University of Groningen Linguistic Knowledge and Word ...

University of Groningen

Linguistic Knowledge and Word Sense DisambiguationGaustad, Tanja

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite fromit. Please check the document version below.

Document VersionPublisher's PDF, also known as Version of record

Publication date:2004

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):Gaustad, T. (2004). Linguistic Knowledge and Word Sense Disambiguation. s.n.

CopyrightOther than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of theauthor(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-amendment.

Take-down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons thenumber of authors shown on this cover page is limited to 10 maximum.

Download date: 19-12-2021

https://research.rug.nl/en/publications/linguistic-knowledge-and-word-sense-disambiguation(0e026126-cb06-45d2-ba94-e297a14118d1).html

Linguistic Knowledgeand

Word Sense Disambiguation

Tanja Gaustad

pd

The work in this thesis has been carried out under the auspices of the Beha-vioral and Cognitive Neurosciences (BCN) research school, Groningen, andhas been part of the Pionier project Algorithms for Linguistic Processing,supported by grant number 220-70-001 from the Netherlands Organisationfor Scientific Research (NWO).

Groningen Dissertations in Linguistics 50ISSN 0928-0030

Document typeset in LATEXc© 2004 Tanja GaustadCover design and illustration by Matty de Vries (www.mattydevries.com)Printed by PrintPartners Ipskamp, Enschede

Rijksuniversiteit Groningen

Linguistic Knowledge

and


Proefschrift

ter verkrijging van het doctoraat in deLetteren

aan de Rijksuniversiteit Groningenop gezag van de

Rector Magnificus, dr. F. Zwarts,in het openbaar te verdedigen op

maandag 1 november 2004om 16.15 uur

door

Tanja Gaustad

geboren op 14 november 1973te Bazel, Zwitserland

iv

Promotor: Prof. dr. ir. J. Nerbonne

Copromotor: Dr. G.J.M. van Noord

Beoordelingscommissie: Prof. dr. W. AbrahamDr. J.A. CarrollProf. dr. W. DaelemansProf. dr. M. Pinkal

Acknowledgements

As with every PhD project, even though there is but one author on the frontcover, many different people have implicitly contributed to the book you nowhave in front of you. They have encouraged and helped me before and duringmy PhD position at the University of Groningen and I would like to use thisopportunity to thank them.

It all started with a few of my University teachers who were involved inmy MA thesis project: Georges Ludi and Pius ten Hacken at the Universityof Basel, and Ulrich Heid at the Institut fur Maschinelle Sprachverarbeitungin Stuttgart. Their feedback and encouragement made me believe in mytalents as a researcher and helped me decide to continue my education witha PhD.

I am especially grateful to my supervisors, John Nerbonne and Gertjanvan Noord. Without all their support, scrupulous (and fast!) reading andcritical comments this thesis would not be what it now is. I would alsolike to thank the members of my reading committee, Werner Abraham, JohnCarroll, Walter Daelemans, and Manfred Pinkal, for their valuable commentsand suggestions.

Among the people who have made me feel welcome and at home inGroningen during the last four and a half years, I would like to mention mycolleagues at the Alfa Informatica corridor. Stasinos Konstantopoulos, IvelinStoianov and Wouter Jansen were a great “welcoming committee”, alwaysready to go for coffee, a beer or a movie. Special thanks go to my (cur-rent and past) office mates Leonoor van der Beek, Jan Daciuk, Rob Koeling,Mark-Jan Nederhof, Erik-Jan Smits, and Jennifer Spenader for discussions onsubjects ranging from linguistic examples, translations to various languages,and LATEX questions to traveling, music and recipes (tested during the AiOetentjes with Leonoor, Menno and Robbert). I would also like to expressmy gratitude to everyone involved in the Pionier project for listening to(unfinished) ideas and giving valuable comments on presentations: Gertjanvan Noord, Gosse Bouma, Leonoor van der Beek, Jan Daciuk, Rob Malouf,Robbert Prins, and Begona Villada Moiron. I am especially indebted to Rob

v

vi

for patiently explaining his maximum entropy package to me. I would alsolike to thank Leonie Bosveld-de Smet for the many moments we have spentdiscussing life and everything related to it, and the rest of my colleagues forproviding a stimulating and pleasant working environment.

Part of this research has been carried out while I was working on the KOP(Kennisontwikkeling in Partnerschap) project on email classification, jointlyfunded by BSC Customer Care, Groningen, and the University of Groningen.I really appreciated the opportunity to work on a “real world” project andthe smooth and very pleasant collaboration with Gosse Bouma, the projectleader.

Apart from “our” corridor, a few other people from the University alsodeserve thanks: all the CLCG graduate students, Anna Hausdorf, Jack Hoek-sema, Laurie Stowe, Rob Visser, Wyke van der Meer and all the secretariesfrom the Cluster Nederlands. Their support—be it intellectual, financial oradministrative—was greatly appreciated.

Besides research, social life was also important during this period. I amindebted to Mirella Derungs, Aletta Eikelboom, Ainhoa de Federico, AndreaHaase and Kurt Brauchli, Gerhard van Huyssteen, Janetta Kuperus andSjors Hoffer, Joanneke Prenger and Gerard Smeenk, Laura Sabourin, MartineVerheul and Martin Korver, and Catherine Zenhausern, for their friendshipand all the entertaining evenings spent together. I am also grateful to myparents, Bluette and Kjell Gaustad, for their continuing unwavering support.I will definitely miss all of you!

Thanks to Leonoor and Joanneke for being my paranimfs. I am lookingforward to defending my thesis with both of you by my side. Also, thisthesis would never have been finished in time without Joanneke and ourweekly meetings, and my Dutch summary would be unintelligible withoutLeonoor’s scrupulous rephrasing.

Finally, and most importantly, I would like to thank you, Menno, for allyour support and encouragement during the last four years. A big cheers toTante Anna’s and Sydney here we come!

Contents

Acknowledgements v

List of Tables xi

List of Figures xv

1 Introduction 11.1 Ambiguity in Language . . . . . . . . . . . . . . . . . . . . . . 11.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Word Sense Disambiguation 72.1 Defining Word Senses . . . . . . . . . . . . . . . . . . . . . . . 82.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Knowledge-Based Approaches . . . . . . . . . . . . . . 112.2.2 Corpus-Based Approaches . . . . . . . . . . . . . . . . 15

2.3 Information Sources . . . . . . . . . . . . . . . . . . . . . . . . 182.3.1 PoS Information . . . . . . . . . . . . . . . . . . . . . 212.3.2 Syntactic Structure . . . . . . . . . . . . . . . . . . . . 222.3.3 Selectional Preferences . . . . . . . . . . . . . . . . . . 232.3.4 Combination of Information Sources . . . . . . . . . . 24

2.4 Problem of Evaluation . . . . . . . . . . . . . . . . . . . . . . 272.4.1 Senseval: A Common Evaluation Framework . . . . . 29

2.5 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Initial Experiments: Pseudowords 333.1 Pseudowords . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . 343.3 Varying Corpus Size . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Corpus and Pseudowords . . . . . . . . . . . . . . . . . 363.3.2 Underlying Assumptions . . . . . . . . . . . . . . . . . 363.3.3 Results and Evaluation . . . . . . . . . . . . . . . . . . 37

vii

viii Contents

3.4 Varying Thresholds for Context Words . . . . . . . . . . . . . 38

3.4.1 Results and Evaluation . . . . . . . . . . . . . . . . . . 39

3.5 Pseudowords versus Real Ambiguous Words . . . . . . . . . . 39

3.5.1 Outline of the Problem . . . . . . . . . . . . . . . . . . 39

3.5.2 Way of Proceeding . . . . . . . . . . . . . . . . . . . . 40

3.5.3 Corpus and Ambiguous Words/Pseudowords . . . . . . 41

3.5.4 Results and Evaluation . . . . . . . . . . . . . . . . . . 43

4 Experimental Setup 47

4.1 Senseval-2 Corpus for Dutch . . . . . . . . . . . . . . . . . . 48

4.2 WSD as Classification Problem . . . . . . . . . . . . . . . . . 50

4.2.1 Maximum Entropy Classification . . . . . . . . . . . . 50

4.2.2 Smoothing with Gaussian Priors . . . . . . . . . . . . . 51

4.3 Building Individual Classifiers . . . . . . . . . . . . . . . . . . 52

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Tuning versus Testing . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . 59

5 Lemma-Based Approach 65

5.1 Accurate Stemming of Dutch . . . . . . . . . . . . . . . . . . 66

5.2 Stemmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2.1 Dutch Porter Stemmer . . . . . . . . . . . . . . . . . . 69

5.2.2 Stemmer with Dictionary Lookup . . . . . . . . . . . . 69

5.2.3 Stand-Alone Evaluation . . . . . . . . . . . . . . . . . 70

5.3 Dictionary-Based Lemmatizer for Dutch . . . . . . . . . . . . 72

5.4 Introducing the Lemma-Based Approach . . . . . . . . . . . . 73


6 Impact of Part-of-Speech Information 79

6.1 Application-Oriented Evaluation of Three PoS Taggers . . . . 80

6.2 Comparison of PoS Taggers . . . . . . . . . . . . . . . . . . . 80

6.2.1 Hidden Markov Model PoS Tagger . . . . . . . . . . . 81

6.2.2 Memory-Based PoS Tagger . . . . . . . . . . . . . . . . 82

6.2.3 Transformation-Based PoS Tagger . . . . . . . . . . . . 83

6.2.4 Stand-Alone Results for the PoS Taggers . . . . . . . . 84

6.3 Integrating PoS Information . . . . . . . . . . . . . . . . . . . 84


6.5 PoS Information in Context . . . . . . . . . . . . . . . . . . . 88

Contents ix

7 Impact of Structural Syntactic Information 917.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.2 Dependency Relations . . . . . . . . . . . . . . . . . . . . . . 96

7.2.1 Alpino Dependency Parser . . . . . . . . . . . . . . . . 977.2.2 Dependency Triples as Features . . . . . . . . . . . . . 98


8 Final Results on Dutch Senseval-2 Test Data 1058.1 Summary of Findings on Tuning Data . . . . . . . . . . . . . 1068.2 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . 108

9 Conclusions and Future Work 1119.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

9.2.1 Semantic Information . . . . . . . . . . . . . . . . . . . 1129.2.2 EuroWordNet to Acquire More Data . . . . . . . . . . 1139.2.3 Other Languages . . . . . . . . . . . . . . . . . . . . . 1159.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . 115

Bibliography 117

Summary 135

Samenvatting 139

x Contents

List of Tables

3.1 Overview Pseudowords. . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Results with varying corpus size (in %), optimal performanceper row in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Results with varying thresholds (in %), optimal performanceper row in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.4 Overview of ambiguous words and corresponding pseudowords. 43

3.5 Pseudowords vs. real ambiguous words: Results (in %). . . . . 44

4.1 Statistics for the training and test section of the Senseval-2data for Dutch. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Comparison of context features with bag of words (“bow”)and with relative ordering (“order”). . . . . . . . . . . . . . . 54

4.3 Comparison of results (in %) with different thresholds on thenumber of training instances using leave-one-out on trainingdata with a Gaussian prior of 1000, context size ±3, context =“order”; † denotes a significant improvement over a thresholdof ≥10; ‡ denotes a significant improvement over the modelusing only context (to be read vertically). . . . . . . . . . . . . 60

4.4 Comparison of results (in %) with different context sizes us-ing leave-one-out on training data with a Gaussian prior of1000, threshold >1, context = “order”; † denotes a signific-ant improvement over a context of ±5, ‡ denotes a significantimprovement over a context of ±10. . . . . . . . . . . . . . . . 61

4.5 Comparison of results (in %) with context words vs. contextlemmas and “bow” vs. “order” using leave-one-out on train-ing data with a Gaussian prior of 1000, threshold >1, contextsize ±3; † denotes a significant improvement over “bow”, ‡ de-notes a significant improvement over context words (to be readvertically). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1 Examples of stemming and lemmatization of Dutch words. . . 66

xi

xii List of Tables

5.2 Accuracy of the Dutch Porter stemmer and the dictionary-based stemmer for Dutch on a 45,000 word evaluation corpus. 71

5.3 Overview of classifiers built and used with the lemma-basedapproach and with word forms as basis. . . . . . . . . . . . . . 75

5.4 Results (in %) on the test section of the Dutch Senseval-2data with the lemma-based approach compared to classifiersbased on word forms; † denotes a significant improvement overthe word form classifiers (to be read vertically). . . . . . . . . 75

5.5 Comparison of results (in %) for lemma-based and word form-based approach for words with different models only; † denotesa significant improvement over the word form classifiers (to beread vertically). . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.6 Comparison of results (in %) from different systems on thetest section of the Dutch Senseval-2 data. . . . . . . . . . . 77

6.1 Stand-alone results (in %) for the three PoS taggers on 10%of the Eindhoven corpus data. . . . . . . . . . . . . . . . . . . 84

6.2 Frequencies of PoS tags assigned by each PoS tagger in theDutch Senseval-2 WSD data and distribution of PoS in theEindhoven training corpus. . . . . . . . . . . . . . . . . . . . . 85

6.3 Results (in %) using leave-one-out on training data with aGaussian prior of 1000, integrating the output from differ-ent PoS taggers; † denotes a significant improvement over theresults with the HMM tagger, ‡ denotes a significant improve-ment over the results with the TBL tagger. . . . . . . . . . . . 86

6.4 Comparison of accuracy with more than one PoS tag assignedby the PoS tagger; † denotes a significant improvement overnot including PoS. . . . . . . . . . . . . . . . . . . . . . . . . 87

6.5 Results (in %) using leave-one-out on training data with aGaussian prior of 1000, including PoS of the ambiguous wordform and PoS of context; † denotes a significant improvementover the results with the HMM tagger, ‡ denotes a significantimprovement over the results with the TBL tagger. . . . . . . 88

7.1 Dependency triples associated with the dependency tree rep-resented in figure 7.1; numbering only given for reference, noorder implied. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2 Frequency of dependency relations linked to a single word inthe Dutch Senseval-2 training data. . . . . . . . . . . . . . . 99

List of Tables xiii

7.3 Results (in %) using leave-one-out on training data with aGaussian prior of 1000, including dependency relations; † de-notes a significant improvement over the results without de-pendency relations. . . . . . . . . . . . . . . . . . . . . . . . . 101

7.4 Results (in %) using leave-one-out on training data with aGaussian prior of 1000, using dependency relations includingwords; † denotes a significant improvement over the resultswith dependency relations including words. . . . . . . . . . . . 102

8.1 Results (in %) on the tuning and test data; † denotes a signi-ficant improvement over the model including PoS in context(to be read vertically). . . . . . . . . . . . . . . . . . . . . . . 108

8.2 Comparison of results (in %) on the test section of the DutchSenseval-2 data with the word form and the lemma-basedapproach; † denotes a significant improvement over the wordform approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.3 Comparison of results (in %) from different systems on thetest section of the Dutch Senseval-2 data. . . . . . . . . . . 110

xiv List of Tables

List of Figures

1.1 Figure illustrating the possible interpretations of the sentenceThe guests left John’s party right away. The dotted lines showall possible combinations of senses for all words, the black lineindicates the correct path. . . . . . . . . . . . . . . . . . . . . 4

4.1 Example of Gaussian distributions with varying σ2. . . . . . . 52

5.1 Diagram of the alternative stemmer with dictionary lookup(SteDL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Schematic overview of the lemma-based approach building ourWSD System for Dutch. . . . . . . . . . . . . . . . . . . . . . 74

7.1 Dependency structure of the sentence Een oorverdovende don-derslag deed de aarde beven. . . . . . . . . . . . . . . . . . . . 97

xv

xvi List of Figures

p

Chapter 1

Introduction

1.1 Ambiguity in Language

In the field of computational linguistics, researchers are mainly concernedwith the computational processing of natural language. A number of resultshave already been obtained, ranging from concrete and applicable systemsable to understand or produce language to theoretical descriptions of theunderlying algorithms.

However, a number of important research problems have not been solved.A particular challenge for computational linguistics pertaining to all levelsof language is ambiguity. Most people are quite unaware of how vague andambiguous human languages really are, and they are disappointed when com-puters are hardly able to understand language and linguistic communicationthe way humans do. Ambiguity means that a word or sentence can be in-terpreted in more than one way, has more than one meaning. It should notbe confused with vagueness, in which a word or phrase has only one mean-ing whose boundaries are not sharply defined. Mostly ambiguity does notpose a problem for humans and is therefore not perceived as such. The onlyexception where ambiguity is actively employed are jokes and puns. For acomputer, however, ambiguity is one of the main problems encountered inthe analysis and generation of natural languages.

We can distinguish various kinds of ambiguity. A word can be ambiguouswith regard to its internal structure (morphological ambiguity). Compoundsare a typical source of morphological ambiguity. Two Dutch examples aremassagebed, which can be analyzed as massage-bed (massage bed) or massa-gebed (mass prayer), and computertaalkunde, with the two analyses computer-taalkunde (computer linguistics) and computertaal-kunde (programming lan-guage knowledge).

1

2 Chapter 1. Introduction

This kind of ambiguity can also be observed more implicitly, such as forexample with the English verb form look : It can either be the infinitive, firstor second person singular or plural, but as soon as the word immediatelypreceding look is known, the ambiguity can be resolved in most cases (e.g.to look is the infinitive, I look is first person singular, etc.).1 Look can alsobe ambiguous with regard to its syntactic class, its so called part-of-speech.In the sentence “We look at her” look is a verb whereas in “She gave him awarning look” it is a noun.

Another kind of syntactic ambiguity can be found at sentence level. Aclassic example is so called PP attachment ambiguity: The sentence “Theman saw the girl with the telescope” is ambiguous with respect to whetherthe man had the telescope and was using it to see the girl or whether the girlwas carrying the telescope. In contrast, the sentence “The man saw the girlwith the ice cream” is not ambiguous for the human reader (we know thatice cream cannot be used to see), while it presents the same difficulty as thetelescope sentence for the computer to resolve. Pragmatics can also lead toambiguity, as e.g. with the interpretation of pronouns. Consider for examplethe two utterances in (1):

(1) Mary’s mother is a gardener. John likes her.

The pronoun her in the second sentence can either refer to Mary or to hermother. The preferred (and congruent) reading would be that John likesMary’s mother, but once more there is potential ambiguity that needs to beresolved.

At word level again, lexical semantic ambiguity occurs when a singleword is associated with multiple senses. We will be focusing on this type ofambiguity in the present thesis. To illustrate the problem of lexical ambiguity,consider the noun party. It can refer to (at least) 4 different things:

• an organization to gain political power (political party),

• a band of people associated temporarily in some activity (search party/party of three)

• a group of people gathered together for pleasure (birthday party)

• a person involved in legal proceedings (third party rights)

1An exception occurs if the preceding word is the personal pronoun you which caneither be singular or plural and which, in addition, can also be used as a direct objectinstead of as the subject. In those cases, more context and more information has to betaken into account to achieve disambiguation.

1.1. Ambiguity in Language 3

Without any further information, a list of possible senses like the one aboveis the best we can do to decide what party refers to. One could also arguethat all these meanings are related and could be subsumed in a more generalsense of party, namely “group of people” (but for many other words no suchgeneral sense can be found).

However, for various applications, such as information retrieval queriesor machine translation, it is important to be able to distinguish between thedifferent senses of the word party. In order to correctly translate an Englishsentence containing party to Dutch for example, we first have to know whichmeaning of party is intended in English and then find the best translationequivalent in the given context in Dutch. The preferred translation for birth-day party would be (verjaardags)feestje, whereas for political party it wouldbe partij—two words with quite distinct meanings. Also, when we formulatean Internet query, there is usually one specific meaning we intend and we onlywant to retrieve documents or links relevant for that particular meaning. So,if, for instance, we are looking for information on a political party, we arenot interested in documents on search parties that have been conducted orlegal issues. For this reason, it is crucial to be able to distinguish the varioussenses of a word.

Now let us consider the meaning of party in the following sentence:

(2) The guests left John’s party right away.

It is quite clear to the human reader that the only possible reading here isthe ‘social gathering for pleasure’. It is interesting to note that most peopleare not even aware of the potential ambiguity contained in this sentence.Humans are so skilled at resolving potential ambiguities that they do notrealize that they are doing it. There has been research on how people resolveambiguities (see Small et al. (1988) for a collection of articles from a psy-cholinguistic and neurolinguistic point of view), but since we (still) do notexactly know how lexical ambiguity resolution is done by humans, it is evenmore difficult to teach a computer to achieve the same thing. Especially ifmore than one ambiguous word is present in a sentence, the number of poten-tial interpretations of the sentence “explodes”: the number of interpretationsis the product of all possible meanings of the words. Assume that only leftand party are ambiguous in the example sentence, and that they both have4 senses. This brings the number of possible interpretations to 16. Imaginewhat happens if there are more senses to take into account as illustrated infigure 1.1 (on page 4) or if the sentence gets longer.

The most prominent way to determine the meaning of a word in a partic-ular usage is to examine its context. The context can be seen as the words


The guests left John’s party right away

the guest leave john’s political entitlem. along

not right search immed. gone

pol. lib. birthday pol.cons.

legal not left

Figure 1.1: Figure illustrating the possible interpretations of the sentenceThe guests left John’s party right away. The dotted lines show all possiblecombinations of senses for all words, the black line indicates the correct path.

surrounding the ambiguous word, in this case party. A word such as guestmight be a good cue for a particular sense of party. But words surroundingthe ambiguous word is not the only kind of information that is available. Un-derneath the simple words lies information on whether a word in the contextis a noun or a verb (its syntactic class), on whether that same word plays therole of subject or object, on the syntactic structure of the entire sentence,etc. All this information is certainly available to people in the process ofdisambiguation and a combination of all these different kinds of informationtogether with general knowledge about the situation and the world is usedto rule out improbable readings.

The main research question we will try to answer in the present thesis iswhich linguistic knowledge sources are most useful for word sense disambigu-ation, more specifically word sense disambiguation of Dutch. Therefore, thestructure of the thesis is based on the various levels of linguistic informationtested for word sense disambiguation, including morphology, information onthe syntactic class of a particular ambiguous word, and the syntactic struc-ture of the entire sentence containing an ambiguous word. Each source oflinguistic knowledge is tested and evaluated individually in order to assessits value for word sense disambiguation. Finally, combinations of knowledgesources are investigated and evaluated.

The goal of our project was to develop a tool which is able to automatic-ally determine the meaning of a particular ambiguous word in context, a socalled word sense disambiguation system. In order to achieve this, we makeuse of the information contained in the context—similar to what humansdo. So we use the words surrounding the ambiguous word, and additionalunderlying information, such as syntactic class and structure, to build a stat-

1.2. Overview 5

istical language model. This model is then used to determine the meaningof examples of that particular ambiguous word in new contexts.

1.2 Overview

Chapter 2 contains an overview of word sense disambiguation, starting withan outline of the problem of word sense disambiguation and the difficulty ofdefining word senses. We then continue with an elaboration of the differentapproaches possible and the various information types used for sense disam-biguation in computational linguistics. Next, a crucial, yet difficult issue inword sense disambiguation is addressed, namely the problem of evaluation.A description of the general approach adopted in this thesis concludes thischapter.

In chapter 3, preliminary experiments with pseudowords instead of realambiguous words are reported on, investigating the importance of corpussize and frequency of context words. Furthermore, the equivalence betweenemploying pseudowords or real ambiguous words to test word sense disam-biguation algorithms is examined. The main conclusion is that the task ofdisambiguating pseudowords and real ambiguous words is not comparable.

The experimental setup used in the remainder of this thesis is introducedin chapter 4. We describe the classification algorithm and smoothing tech-niques as well as the corpus employed. A detailed explanation of the sys-tem and its implementation, as well as first results make up the rest of thechapter. These first results using only the context for disambiguation showthat maximum entropy works well as a classification algorithm for word sensedisambiguation when compared to the frequency baseline.

Chapter 5 presents a variation on the word sense disambiguation sys-tem introduced, the “lemma-based” approach. It tests the hypothesis thatlemmas as bases for classifiers improve generalization and therefore accur-acy. Comparing the lemma-based approach with the (traditional) word formapproach on the Dutch Senseval-2 data shows a significant improvementwhen lemmatization is used. Furthermore, the resulting word sense disam-biguation system is smaller and more robust. We can conclude from thisthat the lemma-based approach is a better alternative than the word form-based approach. A detailed description and evaluation of a newly built stem-mer/lemmatizer for Dutch (a necessary pre-processing tool for the lemma-based approach) are included, too.

Extending our word sense disambiguation system with information onpart-of-speech and reporting on its impact on word sense disambiguation isthe subject of chapter 6. We were especially interested in the importance


of the quality of the part-of-speech tagger used during pre-processing. Wetherefore compare the accuracy of our system including the part-of-speechof the ambiguous word generated by three different part-of-speech taggers.Two conclusions can be drawn from our results: first, that the most accuratetagger on a stand-alone task also outperforms the other taggers on the wordsense disambiguation task, and second, that including information about thepart-of-speech of the ambiguous word increases performance significantly.Including parts-of-speech of the context leads to an even bigger improvementof the disambiguation accuracy achieved.

The addition of deep linguistic knowledge, in the form of syntactic de-pendency relations, is discussed and evaluated in chapter 7. The results of ourmaximum entropy word sense disambiguation system including dependencyrelations are preceded by a detailed explanation of Alpino, the wide-coverageparser for Dutch used to annotate the data, as well as a description of the de-pendency relations employed. The results show that adding dependency rela-tions to our statistical disambiguation system results in a significant increasein performance compared with all results presented earlier. The best resultson the tuning data are achieved with a combination of features, includingthe part-of-speech of the ambiguous word, the context, and the dependencyrelations linked to the ambiguous word.

Chapter 8 presents the results on the Dutch Senseval-2 test data withthe best model based on the tuning evaluation. First, we summarize ourfindings using the training data in a leave-one-out approach. Then, theresults on the test data are presented. The first conclusion we reach is thatthe best model on the tuning data including syntactic information also worksbest on the test data. When applying the same model in a comparisonbetween the word form-based approach and the lemma-based approach, wefind that the lemma-based approach using dependency relations as featuresachieves the best overall performance of our system on the test data. In alast step, we compare our best model to another word sense disambiguationsystem which, to the best of our knowledge, has produced the best resultsfor Dutch to date. Our system achieves significantly higher disambiguationaccuracy than the other model which makes it state-of-the-art for Dutchword sense disambiguation. This is mainly due to the combination of thelemma-based approach and the integration of deep linguistic knowledge inthe form of dependency relations.

We conclude in chapter 9 with some final remarks on the findings presen-ted in the present thesis and thoughts on future work.

Chapter 2


Lexical semantic ambiguity remains a major problem in natural languageprocessing (NLP). Word sense disambiguation (WSD) refers to the resolutionof lexical semantic ambiguity and its goal is to attribute the correct sense(s)to words in a given context.

In our opinion, there are many uses for WSD (even though they arenot uncontested, as recent discussions on the Senseval mailing list show).Accurate disambiguation of word senses is important for e.g. machine trans-lation (MT) and information retrieval (IR). For MT applications, disambig-uating the sense of a source language word is crucial for accurately selectingits translation equivalent in the target language. The English drug for ex-ample can either have the sense of ‘medicine’ (that has been prescribed by adoctor) and is translated to the Dutch word medicijn, or it can mean ‘dope’(an illegal substance like heroin) which has the Dutch translation drugs. Inorder to be able to correctly translate a text containing drug, we first needto know which sense is intended before we proceed to finding a translation.

Similarly, IR benefits from WSD (if it is accurate enough): the ortho-graphic representation of a word conflates a number of senses of a word,many of which may be irrelevant in the context of a specific query. Forinstance, the orthographic form party subsumes both the ‘social gathering’sense and the ‘political organization’ sense whereas the user is only interestedin one of the two meanings and, consequently, only needs to retrieve docu-ments associated with that particular meaning and not the other. Henceretrieval engines are faced with the hard problem of retrieving documentswhich contain the relevant sense of the word while filtering out documentswith senses irrelevant to the query.

In general, the extent of disambiguation required is dependent on the ap-plication. For example, in an MT application, a word may have two senseswhen dealing with one pair of languages, while it may be redundant to dis-

7

8 Chapter 2. Word Sense Disambiguation

criminate the senses when dealing with a different pair of languages. Forinstance the Dutch word berg can either mean mountain or heap in English,whereas both meanings are translated to one word, Berg, in German. Inother cases, it may be enough to distinguish between coarse-grained sensesof selected words, thus considering a restricted version of the WSD problem.We will come back to the granularity of sense distinctions in section 2.1.

Even though word sense disambiguation is concerned with the attributionof semantic distinctions namely the meanings of words, syntactic ambiguityalso comes into play, such as ambiguity with regard to the syntactic class ofa given ambiguous word or with regard to the entire sentence containing anambiguous word. We will therefore not focus on the difference between (lex-ical) semantic and syntactic ambiguity, but will concentrate on attributingthe correct sense to an ambiguous word given a predefined list of meaningsusing whatever source of knowledge is useful.

We will first proceed with a discussion of the problem we are trying tosolve, the definition of word senses and sense inventories in the context ofWSD (section 2.1). This introduction into WSD will be followed by an over-view of (a selection of) previous work done in the field of WSD, attempting adivision according to the approach used, on the one hand (section 2.2), andaccording to the information sources used by the different methods, on theother hand (section 2.3). By approach or strategy we refer to the primaryresource used to extract information about the different senses of words, incontrast to information sources which refer to the type of knowledge used tofind the correct senses. Next, we talk about the problem of evaluation andthe Senseval evaluation exercises for WSD (section 2.4). A description ofthe general approach adopted in the present thesis in section 2.5 will concludechapter 2.

2.1 Defining Word Senses

The phenomenon of lexical ambiguity is traditionally subdivided into poly-semy and homonymy. Polysemy refers to one word having several relatedmeanings (e.g. line meaning ‘thread’, ‘row’, ‘course of conduct’, etc.) whereashomonymy describes the fact that two words have the same lexical form, butdifferent etymologies and unrelated meanings (e.g. bank—‘financial institu-tion’ versus ‘river bank’).

Early disambiguation strategies, e.g. Hirst (1987), assumed that domain-priming in the different senses of homonyms is strong enough to make dis-ambiguation easy, therefore focusing more on polysemy. This strategy canbe useful when applied to terminology in specialized domains, but is most

2.1. Defining Word Senses 9

probably not sufficient for words with more common senses in everyday lan-guage utterances (demonstrated in various papers contained in Pustejovskyand Boguraev (1996)). It can be the case, however, that the contexts ofhomonyms have more discriminatory power than the contexts of polysemes(as argued in the introduction of Ravin and Leacock (2000)). Most cur-rent research in WSD treats both kinds of lexical ambiguity, homonymy andpolysemy, without drawing a (clear) line between the two phenomena.

One of the most difficult issues in applied lexical semantics is the definitionof word senses. In dictionaries, each word is listed with a number of discretesenses and subsenses, possibly different from dictionary to dictionary. Butthe assumption of a finite number of discrete senses is quite problematicfor natural languages. Often the various senses are actually related to oneanother and it is unclear where to draw a line between them. Kilgarriff(1997) takes a rather radical position with regard to this issue stating thatword senses do not exist as such, but only relative to a task. His propositionis to use corpus citations as basic objects in an ontology. These citationscan then be clustered into “senses” according to the purpose for which theyare needed. Kilgarriff’s conclusion remains that he does not believe in wordsenses.

“The point of departure for most work on word sense disambig-uation is the multiple lexical entry view [...]: a lexical item isassociated with discrete senses identified in advance, and the jobof the disambiguation module is to select one of these senses asthe meaning intended by the use of a particular word in a partic-ular context. This approach is therefore subject to the criticismsof inadequacy put forth above, in that it ignores potential contex-tual influences on the precise sense a use of a word has—contextcan only influence the selection of a sense, not the determinationof a sense.” (Verspoor, 1997, p. 225)

Instead of assuming an a priori established set of senses which exists in-dependent of context, a generative lexicon (Pustejovsky, 1995) could providea different approach to the concept of word sense. In this approach a senseremains underspecified until context is taken into account and its representa-tion is dependent on the discourse in which a word is found. Little attentionhas been paid so far to this potential refuge from the difficulties of determ-ining an adequate and appropriate set of senses for WSD. We will thereforeassume a priori senses nonetheless.

There is also a linguistic notion, i.e. that meanings should be recognizedto be distinct whenever some linguistic process can be shown to be sensitive


to the distinction (Zwicky and Sadock, 1975). Nerbonne (1993) suggestedthat quantification and anaphora are appropriate phenomena for commonnoun phrase meanings. Thus “two parties” must refer to two entities ofthe same type, never one social gathering and one political group. Similarly“Smith shunned one party only to be overtaken by another” must refer to twoentities of the same type. But even if this notion is linguistically plausible,there are no lexica which have been compiled using these principles.

The first step involved in the task of WSD is the determination of differentsenses for all words in the text to be disambiguated, the sense inventory. Thedetermination of senses can either be exhaustive, i.e. all possible meaningsof a given word are identified, or tuned to the particular domain of the textunder consideration. Most recent work in WSD relies on predefined sensesconsisting of either dictionary senses, a group of features or categories (as ina thesaurus), or translations from other languages.

All sense inventories have potential flaws. The major problem with atranslingual definition of senses is mutual ambiguity in both (all) languages,i.e. that the word pairs are ambiguous in the source and target language,and limited coverage of (minor) senses. Thesaurus categories only provide arather coarse-grained distinction because the categories correspond to generalconceptual classes, such as animal or body, which only provide very broadsenses. Also, words in very general categories will not easily be disambigu-ated because they usually have many closely related senses that will not becaptured by the thesaurus categories. Even though dictionary or ontologylistings have been used most extensively, they nevertheless present severalproblems, too. Since the information contained as well as the structure ofthe dictionaries vary in many degrees, tasks using different sense inventoriescannot be compared. Obviously, it is very difficult to decide on which rep-resentation should count as ultimate standard, and it is even questionablewhether any such standard should be chosen and set for all applications.Even so, we will be working with a predefined sense inventory consisting ofthe sense labels contained in the data used and compiled on the basis of adictionary.

Another difficulty researchers face is the granularity of sense distinctionsthat needs to be taken into account. One might expect the major distinctionsbetween word senses to overlap in most dictionaries which would favor acoarse-grained sense inventory in order to make results more comparable.Depending on the application, however, this level of sense distinction mightnot be detailed enough. In that case, the more fine-grained distinctions alsoneed to be included in the inventory in order to be able to distinguish senseson a more detailed level.

Resnik and Yarowsky (1999) propose to restrict a word sense inventory

2.2. Approaches 11

to “distinctions that are typically lexicalized cross-linguistically” (p. 122).In their view, this approach is situated in the middle between distinguishingonly homographs within one language (a very coarse-grained distinction) andtrying to capture all fine-grained distinctions made in monolingual diction-aries. The basic idea is to define a set of target languages and dictionariesand then require every sense distinction to be realized lexically in a min-imum number of these languages. To our knowledge, this definition of asense inventory has not been taken up by the WSD community.

The sense inventory for the Dutch data used in this thesis has been com-piled in a different way. The data originates from a sociolinguistic researchproject which investigated the active vocabulary of children between 4 and 12in the Netherlands (Schrooten and Vermeer, 1994). For this purpose, a real-istic word list containing the most common words used at elementary schoolwas put together on the basis of 102 illustrated children books and a basicdictionary of Dutch (Van Dale, 1996). The sense inventory is non-hierarchical(in contrast to inventories extracted from dictionaries or ontologies) and hasbeen chosen by the project leaders on the basis of the Dutch dictionary.

2.2 Approaches

With regard to the approaches or strategies employed, there are three ways toapproach the problem of assigning the correct sense(s) to ambiguous words incontext: a knowledge-based approach, which uses an explicit lexicon (machinereadable dictionary (MRD), thesaurus) or ontology (e.g. WordNet), corpus-based disambiguation, where the relevant information about word senses isgathered from training on a large corpus, or, as a third alternative, a hybridapproach combining aspects of both of the aforementioned methodologies(based on Agirre and Martınez (2001) and Ide and Veronis (1998)).

2.2.1 Knowledge-Based Approaches

WSD systems building on the information contained in MRDs use the avail-able material in various ways. Lesk (1986) was the first to use dictionarydefinitions to disambiguate ambiguous words. To automatically decide whichsense of a word is intended, he counts overlapping content words in the sensedefinitions of the ambiguous word and in the definitions of context words oc-curring nearby. The by now classic example mentioned by Lesk is the wordcone which can either mean ‘pine cone’ or ‘ice cream cone’. Suppose that theword preceding cone in a given sentence is pine. If we compare the dictionarydefinitions of pine and cone, we find an overlap between the two definitions


(marked in bold):

• Pine: kind of evergreen tree

• Cone: fruit of a certain evergreen tree

So if pine occurs in the same context as cone, we can decide by countingdefinition overlaps that cone is used in the sense of ‘pine cone’ in that occur-rence.

Computing every combination of senses using Lesk’s idea and seeking theoptimal combination with respect to mutual overlap in entry content words,however, is computationally very expensive because of the huge amount ofdata that needs to be compared. The introduction of simulated annealing inNLP (Cowie et al., 1992) made the approach practically feasible: rather thancomputing the definition overlap for all possible combinations of senses, thesimulated annealing optimization algorithm (Metropolis et al., 1953) identi-fies an approximate solution. Using the Longman Dictionary of Contempor-ary English (LDOCE) (Procter, 1978) and simulated annealing, Cowie et al.correctly disambiguated 47% of words to the sense level. When choosing acertain sense, a simple count of the number of tokens in common betweenall the definitions for a given choice of senses was used. But this methodprefers longer definitions because more words can contribute to the overlap.Stevenson and Wilks (2001) alternately compute the overlap by normalizingthe contribution of a word to the overlap count by the number of words ofthe definition that contained the overlapping word. A different extension ofLesk’s algorithm is described in Pedersen and Banerjee (2002). Instead ofusing a standard dictionary as the source of definitions, they employ glosses1

contained in WordNet (Fellbaum, 1998; Miller et al., 1990), a lexical data-base for English which we will explain in more detail later in this section.Their algorithm also exploits the hierarchy of semantic relations containedin the ontology.

Certain dictionaries, such as e.g. LDOCE, contain additional informationwhich can be used for disambiguation. Besides using simulated annealingon the basis of the dictionary definitions, Stevenson and Wilks (2001) integ-rate information on pragmatic codes and selectional restrictions contained inLDOCE into their dictionary-based WSD system.

The major problem with using MRDs is that dictionaries are created forhuman use, and due to inconsistencies (a well-known problem among lex-icographers, cf. Kilgarriff (1994)), automatic extraction of large knowledge-

1In contrast to dictionary definitions, glosses are examples of the usage of a particularword either gathered from corpora or constructed artificially.

2.2. Approaches 13

bases from MRDs has not fully been achieved so far.2 Regardless of theseshortcomings, MRDs are widely used in WSD for English and provide aready-made source of information about word senses. Unfortunately, forDutch no such source of information is available.

Thesauri, on the other hand, have not been used extensively, but seemnevertheless to be applicable to WSD (Gale et al., 1992d; Yarowsky, 1992),albeit only for coarse-grained sense distinctions. In Yarowsky (1992), thesense of a word is defined as its category in Roget’s International Thesaurus(Chapman, 1977). These categories correspond to conceptual classes, such asfor example animal or tool for the ambiguous word crane.3 Since differentword senses tend to belong to different classes and these classes, in turn,tend to appear in recognizably different contexts, sense disambiguation (i.e.listing a category in this particular setting) can be achieved by identifyingsalient words in the context, determining weights for them and then usingthese weights to predict the appropriate category of a new word. As withmany early WSD algorithms, Yarowsky’s thesaurus-based algorithm has onlybeen evaluated on 12 words. On this task, though, his approach achieves anaccuracy of 92% which is higher than that of all the other systems the resultsare being compared to.

Lately, the use of WordNet as an ontology for WSD has become increas-ingly popular. WordNet includes various potential sources of information,such as definitions and glosses of word senses, synsets which subsume syn-onyms representing a single lexical concept and are organized in a conceptualhierarchy, semantic relations (hyponymy and hyperonymy, antonymy, mer-onymy) between words/synsets. The fact that WordNet provides the broad-est set of lexical information in a single resource is one of the reasons forits wide-spread use. Another important characteristic is that it is the firstbroad-coverage lexical resource that is freely and widely available. WordNethas its limitations as well: its fine-grained sense distinctions and the irregu-lar and varying relative granularity pose a problem often cited in literature.WordNet’s sense division and lexical relations have nonetheless become astandard for English WSD.

A WordNet for Dutch has been built in the context of the EuroWordNetproject (Vossen, 1998).4 Unfortunately, the Dutch WordNet contains lessinformation than its English counterpart. There are no glosses included to

2There has been some work on extracting large knowledge bases from MRDs, e.g. inthe ACQUILEX project (http://www.cl.cam.ac.uk/Research/NL/acquilex/), as wellas on the automatic extraction of subcategorization lexicons (e.g. Gahl (1998)).

3Example taken from Yarowsky (1992).4More extensive information on this project can be found at http://www.illc.uva.

nl/EuroWordNet.


provide examples of the usage of a given sense of a word and certain semanticrelations, e.g. antonyms, are barely annotated in the database. Moreover, ithas not been used to annotate WSD data for Dutch.

The English WordNet has been applied in various ways in WSD. Leacocket al. (1998) employ WordNet to counter data sparseness. They test auto-matically acquired training examples for a noun (line), a verb (serve) and anadjective (hard) with a statistical classifier and evaluate the test results incomparison to manually tagged training examples. WordNet’s lexical rela-tions are used to locate the training examples in a text corpus. By identifyingunambiguous words (i.e. words with only one sense) that stand in a (direct)relation to an ambiguous word (synset, parent or daughter node), annotatedcorpora can be built for all senses for which an unambiguous counterpartcould be found. Leacock et al. use the example of the noun suit to illustratethe technique: one sense of it has business suit as unambiguous daughterand another sense has legal proceedings as a parent. By collecting instancescontaining these nouns, automatically extracted training corpora for thesetwo senses of suit can be used to train a statistical classifier. If, however,a certain sense of suit does not have an unambiguous correlate among itsdirect relations, no training material can be acquired.

Hawkins (1999) learns contextual scores from higher level nodes in theWordNet hierarchy to disambiguate all words in a given sentence. His WSDsystem works with frequency and contextual information that is based onWordNet. Frequency information is used to measure the likelihood of eachpossible sense appearing in the text. Contextual information is based on theWordNet hierarchy and corresponds to learning contextual scores betweennodes in the hierarchy. A contextual matrix of high-level concepts in Word-Net is computed and stored along with the contextual matrix scores betweenall nodes contained in the matrix. The contextual score is different fromsemantic distance in that it aims to represent the “likelihood of two conceptsappearing in the same sentence”. In contrast, semantic distance representsthe extent to which two concepts are semantically similar based on a lexicalhierarchy.

The algorithm itself iteratively learns the contextual scores from the train-ing data and and tests them on the validation data. If the correct sense hasnot been assigned, the scores are adapted depending on which context senseswere responsible for the misclassification. The changes become less withevery iteration (in a way similar to simulated annealing). The two know-ledge sources are then combined by additive weights. Frequency informationprovides fine-grained evidence as it operates at the sense level. The contex-tual information, on the other hand, provides more coarse-grained evidenceby working above word level.

2.2. Approaches 15

Since the algorithm aims at disambiguating all ambiguous words in asentence, interdependencies between sense choices need to be considered.Hawkins chose to eliminate senses considering the scores at word level (fre-quency information) and the contribution of each sense to the best overallscore for the sentence. The system was evaluated on SemCor with reasonableresults. Hawkins and Nettleton (2000) present the same system, but testedon the English Senseval-1 data and with clue words added as a third sourceof knowledge. These clue words consist of manually identified context wordsthat identify fixed or idiomatic expressions.

WordNet has also variously been used to determine the semantic distancebetween senses. Agirre and Rigau (1996) employ WordNet to determine theconceptual distance among concepts for the disambiguation of nouns whereasMihalcea and Moldovan (1999) exploit semantic density and WordNet glossesin an all words design. Other approaches using an ontology include Agirreand Martınez (2000), Agirre and Rigau (1996), Haynes (2001), Lin (1997),and Lin (2000). Also, a combination of MRDs and WordNet has been triedwith success (Litkowski, 2000, 2001; Mihalcea and Moldovan, 1998).

2.2.2 Corpus-Based Approaches

A corpus-based approach extracts information on word senses from a largeannotated data collection, a so called sense-tagged corpus. The possiblemeans used to attribute senses to ambiguous words are then distributionalinformation, context, and further knowledge that has either been annotatedin the corpus or added during pre-processing. Distributional informationabout an ambiguous word refers to the frequency distribution of its senses.Context is composed of the words found to the right and/or the left of acertain word, thus collocational or co-occurrence information.5 Additionalknowledge sources can be exploited, such as lemmas, part-of-speech (PoS),syntactic annotations, etc. (see section 2.3). Examples of corpus-based sys-tems are plentiful (see e.g. Agirre and Martınez (2000), Ng and Lee (1996),and Yarowsky (1993)) because performance is usually very accurate and moresense-tagged material is (slowly) becoming available in the context of com-mon evaluation exercises (see section 2.4).

The major difficulty of a corpus-based approach, however, remains thedata acquisition bottleneck (Gale et al., 1992b; Ng, 1997): raw corpora donot indicate which sense is applicable for a word in a given context. In order

5This term should not be confused with “collocations” or “idiomatic expressions” whichdenote lexically fixed expressions with a non-compositional meaning, e.g. kick the bucket

‘to die’. Collocational information is literally the information that is “co-located” aroundthe ambiguous word.


to be able to use corpora as an information resource for WSD, they have tobe annotated with word senses and this process is very labor intensive. Sofar, there has not been a lot of sense-tagged material made publicly available,especially for languages other than English. So, one approach to solve theproblem (and also the predominant one) has been to manually sense-tagcorpora using a given sense inventory, e.g. (Euro)WordNet hierarchies ordictionary sense listings. Another, less time consuming, possibility is theapplication of less data-intensive (with respect to annotated data) approachesto WSD, such as bootstrapping or unsupervised techniques (although forevaluation sense-tagged data is still needed, see section 2.4).

There are two possible approaches to corpus-based WSD systems: super-vised and unsupervised WSD. Supervised approaches use annotated trainingdata and basically amount to a classification task. During training on a dis-ambiguated corpus, information about context words and other knowledgesources included in the system as well as distributional information aboutthe different senses of an ambiguous word are collected. In the testing phase,the sense with the highest probability or similarity computed on the basisof the training data is chosen. Training and evaluating such an algorithmpresupposes the existence of sense-tagged corpora.

Depending on the machine learning (ML) algorithm used, corpus-basedsupervised WSD systems can roughly be classified into exemplar-based, rule-based, or probabilistic approaches. In an exemplar-based paradigm, the k-nearest neighbor technique has been employed most (Dini et al., 2000; Fujii,1998; Federici et al., 1999; Ng and Lee, 1996), also for Dutch (Hendrickxet al., 2002; Hoste et al., 2002a; Veenstra et al., 2000). The basic intuitionbehind the systems based on this method is that, because of the distributionof linguistic events with many hapaxes and low-frequency events, all inform-ation needs to be taken into account and other learning algorithms are at adisadvantage because they prune training examples that may be useful mod-els to extrapolate from (Daelemans et al., 1999). Therefore, all instancesencountered during training are stored in a database and test instances aredisambiguated by extrapolating the class of the nearest neighbors containedin the database.

Rule-based approaches (Li et al., 1995; Martınez et al., 2002; Pedersen,2002; Yarowsky, 2000) use algorithms, e.g. decision lists, which search fordiscriminatory features in the training data and build an ordered set of ruleson the basis of the discriminatory power of these features. The rules are thenapplied to the test instances.

A third technique is the use of different probabilistic classifiers. Despiteits relative simplicity, naive Bayes has been frequently applied in WSD withgood results (Chodorow et al., 2000; Gale et al., 1992d; Leacock et al., 1998;

2.2. Approaches 17

Pedersen, 2000). Various sorts of log-linear models have also been introduced(Bruce and Wiebe, 1994; Pedersen and Bruce, 1997b; Pedersen et al., 1997;Pedersen, 1998) with success. Lately, combining various probabilistic classi-fiers has been tested in order to reach better results (Escudero et al., 2000a;Florian et al., 2002; Hoste et al., 2002b; Klein et al., 2002).

Unsupervised algorithms, on the other hand, are applied to raw text ma-terial and annotated data is only needed for evaluation. They correspond to aclustering task rather than a classification (or sense tagging) task. Sense tag-ging is not possible in a completely unsupervised way since it requires thatsome characterization of the senses be provided. Disambiguation as sensediscrimination can be achieved through unsupervised clustering: cluster thecontexts of an ambiguous word into a number of groups and discriminatebetween them without labeling them (Pedersen and Bruce, 1997a; Schutze,1998). A clear disadvantage is that, so far, the performance of unsupervisedsystems lies a lot lower than that of supervised systems (see e.g. Escuderoet al. (2000b) for a comparison).

Bootstrapping can be seen as a middle way between using no annotateddata at all and using only annotated data. Bootstrapping means that a smallcorpus is sense-tagged by hand and statistical information is extracted fromthe context of these occurrences. Iteratively, large amounts of unlabeled dataare then labeled using this information, and the new correctly labeled datais in turn used as input to collect statistical information. In this way, labeleddata can be acquired quickly and incrementally. Its quality is assured throughhand-correction (which is a lot less time-consuming than hand-labeling allthe data). This method has been applied to WSD with good results (Basiliet al., 1997; Hearst, 1991; Karov and Edelman, 1996; Mihalcea and Moldovan,2001a; Yarowsky, 1995).

Another means to alleviate the need for hand-tagged data are parallelcorpora (Dagan et al., 1991; Dagan and Itai, 1994; Diab and Resnik, 2002;Gale et al., 1992b; Ide et al., 2002; Ng et al., 2003). In a bilingual corpus,word correspondences are identified and the translations are used as sensetags. As noted by Ng et al. (2003), tying sense distinctions to the differenttranslations in a target language introduces a more “data-oriented” view ofsense distinction and also adds some more objectivity to defining senses—although the choice of languages remains subjective and can have a majorinfluence on the objectivity achieved. Practical issues with an approach usingparallel corpora mainly include the size of the parallel corpus required andthe quality of the word alignment. Other problems with this method are thelimited coverage (words do not appear in the corpus or lack of examples forsecondary senses) and mutual ambiguity across languages. As more parallelcorpora become available, the coverage can be increased, but there is still no


guarantee that all senses of a given word appear. Another problem, notedin Diab and Resnik (2002), is that even though a word-sense combinationis translated with some consistency into a relatively small set of words inthe second language, that set of words rarely contains unambiguous words.They therefore assume that words having the same translation at least sharea dimension of meaning if not the exact sense. Diab and Resnik use WordNetas a sense inventory for the translations and an algorithm which reinforcesthe correct sense of a word by the semantic similarity of other words withwhich it shares those dimensions of meaning.

2.3 Information Sources

We will now present the different information sources or types of know-ledge that can potentially be employed in a WSD system. These informationsources can be used in any WSD system, whether it is based on MRDs or thelike, corpora or a combination of approaches. In a probabilistic corpus-basedapproach, this kind of knowledge is usually encoded as sets of features andcorresponding feature values.

In Agirre and Martınez (2001), a comparison of WSD systems on thebasis of the information sources they employ is presented. The authors tryto evaluate the contribution of each knowledge type separately and to sys-tematize the relation between desired knowledge types and approaches used.The knowledge types deemed useful for WSD (based on Hirst (1987), McRoy(1992) and their own research) are the following.

• Frequency of senses

• Part-of-speech (PoS)

• Morphology

• Collocational information

• Semantic word properties

– taxonomical organization

– situation

– topic

– argument-head relation

• Syntactic cues (subcategorization information)

• Semantic roles

2.3. Information Sources 19

• Selectional preferences

• Domain

• Pragmatics

Using the frequency of senses exploits the distribution of senses in a givencorpus, whereas PoS can serve as a first step to disambiguate between tokenswhich have the same orthographic form, but different syntactic classes. Mor-phological information is important to generalize over different morphologicalinstantiations of the same lemma, for instance. Disambiguation usually re-lies heavily on context information, mostly collocational information, butmore abstract context information can also be useful for WSD. Taxonomicalorganization refers to the classification of words in a hierarchy and the lexical-semantic relationships holding between words (e.g. a cat is a kind of animal),information on the situation and the topic place a given ambiguous word in abroader context (e.g. if the word mouse is used in an office situation and thetopic is computer use, the most probable sense will be ‘computer tool’ andnot ‘animal’), and argument-head relations help identify strong disambigu-ation clues (e.g. two words in a coordinate relationship like cat and mouseusually share certain properties, in this case that they are both animals).Subcategorization information specifies the valency of a verb which can, inturn, be enough to disambiguate it. For example the verb drink is transit-ive for the ‘take in liquids’ sense whereas it is intransitive for the ‘consumealcohol’ sense(s). Semantic roles and selectional preferences encode similarinformation. Consider the sentence “The cat chased the mouse”. The verbchase can either refer to ‘the act of pursuing’ or to ‘chiseling’. The object ofchase fills the experiencer role, information which can be used to constrainthe possible senses of chase. At the same time, the object of chase is an-imate which corresponds to the (selectional) preference of the ‘pursue’ senseof chase for animate objects in contrast to the ‘chisel’ sense which prefersinanimate objects. Domain information is used to constrain the consideredsenses and seems to include valuable information (Magnini et al., 2002). Wewill come back to this issue in section 9.2. Taking into account pragmaticscan help to solve problems related to general reasoning and discourse im-plications such as in the example from Hirst (1987) where head needs to bedisambiguated: “Nadia swung the hammer at the nail, and the head flewoff”.

Traditionally, lexical knowledge bases (LKB) containing the desired know-ledge listed above have been built mainly by hand (Kelly and Stone, 1975;Hirst, 1987; McRoy, 1992). McRoy (1992), for instance, includes a lexicon


(which contains information on PoS, morphology, subcategorization, and fre-quency) and dynamic lexicons (domain information), a concept hierarchy(containing a taxonomy, semantic roles, and selectional preferences), colloc-ational patterns, and clusters of related definitions (to capture knowledge ofthe situation and topic). A very impressive accomplishment if done by hand.Unfortunately, so far no (semi-)automatic means have been found to buildsuch LKBs.

Agirre and Martınez proceed to an evaluation of the contribution of eachknowledge type separately in a common setting: the English sense inventoryfrom WordNet 1.6 (Miller et al., 1990) and a test set from SemCor (Milleret al., 1993). The authors conclude that algorithms based on hand-taggedcorpora perform best, a conclusion reached by many researchers in WSD.Mostly two sets of features are distinguished when using ML algorithmstrained on hand-tagged corpora: local features which—according to theauthors—take into account local dependencies in the form of collocationalinformation, argument-head relations, and syntactic cues, as well as wordforms, lemmas or PoS, and global features such as a bag of lemmas or words6

in a large window and semantic properties related to situation and/or topic.7

Their results show that good indicators, if learned from hand-tagged cor-pora, are the following information sources: frequency of senses, collocationalinformation, semantic word properties, syntactic cues, and selectional prefer-ences. Semantic word properties regarding topic and situation provide goodcues, but are difficult to separate from each other, and argument-head rela-tions work as a strong indicator if given as a sense-to-word relation. Taxo-nomical information, on the other hand, is very weak. Selectional preferencesare reliable, but are not easily applied (see section 2.3.3 for a more detaileddiscussion of the use of selectional preferences in WSD systems).

PoS, morphology, semantic roles, domain, and pragmatics have not beentested at all.8 Agirre and Martınez’s results seem to confirm McRoy’s findingsthat collocational information and semantic word properties are the mostimportant knowledge types to consider for WSD, but syntactic cues seem tobe equally reliable. The authors conclude with the following remark:

6The notion of “bag” refers to the fact that the lemmas or words are considered in-dependently of their position in the context, as an unordered set. Multiple entries arepossible.

7We do not quite agree with Agirre and Martınez on the fact that situation and topicare automatically included when a large context window is used. Also, we think it isbetter to specifically include information about syntactic cues instead of relying on thelocal features to inherently represent these dependencies, as will be shown in chapter 7.

8Recently, however, some of these knowledge sources have been integrated in WSDmodules (PoS in Gaustad (2003), morphology in Gaustad (2004) and Yarowsky (1994),domain in Magnini et al. (2002)).


“We think that future research directions should resort to allavailable information sources, extending the set of features tomore informed features. Organizing the information sources a-round knowledge types would allow for more powerful combina-tions. [. . . ] Having a large number of knowledge types at handcan be the key to success in this process.” (Agirre and Martınez,2001, p. 9)

Continuing in the same line, we have chosen to investigate the extent towhich a corpus-based supervised statistical WSD system for Dutch mightbenefit from various sources of linguistic knowledge. We especially includedknowledge types that have not been tested in Agirre and Martınez (2001),such as PoS and morphological information, along with syntactic cues. Sev-eral of these types of information sources will now be introduced in moredetail, starting with PoS information, then syntactic structure, followed byselectional restrictions. Finally, a discussion of the combination and inter-action of information sources, including several examples from the literaturewill conclude section 2.3.

2.3.1 PoS Information

Since the beginning of research into WSD, it has been generally agreed thatmorpho-syntactic disambiguation and sense disambiguation are to be treatedas two separate problems (see e.g. Kelly and Stone (1975)). This means thatfor homographs with different PoS, morpho-syntactic disambiguation at thesame time performs sense disambiguation. Especially since the developmentof accurate PoS-taggers, WSD has primarily focused on distinguishing sensesof words belonging to the same syntactic category.

However, Wilks and Stevenson (1998) argue that “part-of-speech ambi-guity should be treated as part of the problem of word sense disambiguation”(p. 136). Their general conclusion is that a majority of coarse-grained sensedistinctions can be resolved by knowing the word’s PoS. In their approach,however, they disambiguate to the homograph level only. In the LongmanDictionary of Contemporary English (LDOCE), the senses of an ambiguousword are grouped into sets of senses with related meanings, homographs, andtherefore disambiguating to the homograph level represents a less fine-graineddistinction than disambiguation to the sense level. In this case resolving PoSambiguity is indeed very narrowly tied to the problem of word sense disam-biguation.

There are indisputably some similarities between PoS tagging and sensetagging, but there are also marked contrasts. Kilgarriff (1998a) lists three ma-


jor differences between the two. Syntactic tags used in PoS tagging have clear,uncontested definitions, whereas, for sense tagging, there are no such generalcategories. Every word has a new and different set of senses which, further-more, depends on the particular dictionary employed. Secondly, “while PoStagging is one task, WSD is as many tasks as there are ambiguous words inthe lexicon” (p. 456). Also, the primary goal of PoS tagging can be seen to bea preliminary to parsing. In the case of sense tagging, there is no single dom-inant purpose for which the sense-tagged text will be used. It could be usedin lexicography, lexical acquisition, parsing, information retrieval, informa-tion extraction, machine translation, etc. For all these tasks, it might be thecase that different sets of senses will be pertinent. All of these differencesmake it much harder to concretely define sense tagging.

The Dutch WSD data available from Senseval-2 is ambiguous with re-gard to senses and PoS. The strategy we decided on was to explicitly integrateinformation on the PoS for each instance of an ambiguous word in our model,following the reasoning that current PoS-taggers provide very reliable clas-sifications. In order to test the intuition that high quality input is likely toinfluence the final results of a complex system, several PoS-taggers were com-pared extensively. First, the data was automatically tagged by the differenttaggers and in a second step, the acquired information was integrated intothe feature model of the disambiguation algorithm. We reach the conclusionthat PoS information of the ambiguous word itself, as well as of the contextis a useful feature for WSD and yields improvements in performance (seechapter 6 for more details).

2.3.2 Syntactic Structure

Not many approaches to WSD exist which use syntactic information, and theones that do exist have only been tried for English and no other languages.Also, in most cases syntactic information is used in combination with anontology in order to reduce the need for sense-tagged data.

Li et al. (1995) explore the idea of using surface-syntactic analyses to-gether with WordNet to disambiguate nouns in object position.9 The basicidea is that the combination of the WordNet ontology and syntactic inform-ation will minimize the need for other information sources. The semanticsimilarity of words is defined on the basis of the WordNet IS-A hierarchy.The disambiguation process itself makes use of the verbs that dominate nounobjects in a sentence, so called verb-noun pairs. Most of the system is based

9Only nouns in object position are considered for disambiguation, but according to theauthors it could easily be extended to nouns in other positions.


on heuristic rules to reach a decision. The two main conclusions the authorsdraw is that some verb contexts are not strong enough to limit the possiblesenses of their noun objects to the correct sense and in some cases the mean-ing obtained by the algorithm is suitable in the verbal context considered,but not for the whole text.

Lin (1997, 2000) presents a way of defining local context as the syntacticdependencies between words in a sentence. Instead of building separate clas-sifiers for each word, past usages of other words are used to disambiguate thecurrent word, based on the hypothesis that “two different words are likely tohave similar meanings if they occur in identical local contexts” (Lin, 1997,p. 64). The local context consists of dependency triples containing the typeof dependency relation of a given word, as well as word-frequency-likelihoodtriples containing the frequency of a word in a particular local context andthe likelihood ratio of the context and the word. Disambiguation takes placeby finding words which appear in identical contexts as the target word andthen maximizing the similarity between all those words and the target word.WordNet is used as a sense inventory and to derive the similarity. Lin’s mainconclusion is that defining local context in terms of dependency relationsinstead of as surrounding words gives better results, especially when the sizeof the training set is very small.

A more exhaustive discussion on the use of syntactic information in WSDcan be found in chapter 7. The results testing the use of dependency rela-tions in our WSD system for Dutch clearly show that syntactic informationprovides very useful disambiguation clues and leads to a significant increasein performance.

2.3.3 Selectional Preferences

As we have mentioned above, selectional preferences are a good source ofinformation for WSD, but their applicability and availability is rather low.Selectional preferences encode similar knowledge to argument-head relations,but given in terms of semantic classes rather than plain words.

Agirre and Martınez (2002) propose a method to make information aboutthe selectional preferences of words more generally available and accessible.They present an algorithm for the integration of selectional preferences inWordNet, extending existing selectional restriction learning methods fromword-to-class relations to class-to-class preferences. Precision rates on Sem-Cor are about the same for both approaches, but recall significantly increaseswith class-to-class preferences due to the better generalization and the highercoverage achieved. Despite the efforts to include this kind of information inWordNet, few systems make use of them in WSD.


Resnik (1997) reports on an experiment to integrate a statistical modelof selectional preferences into an unsupervised WSD system. His main con-clusion is:

“Although selectional preferences are widely viewed as an import-ant factor in disambiguation, their practical broad-coverage ap-plication appears limited—at least when disambiguating nouns—because many verbs and modifiers simply do not select stronglyenough to make a significant difference.” (Resnik, 1997, p. 56)

Notwithstanding these rather negative results, research on WSD using se-lectional preferences as a source of knowledge has been carried on. McCarthyand Carroll (2003) present a detailed evaluation of integrating automaticallyacquired selectional preferences in a WSD system for English. Their mainconclusion is that selectional preferences (in isolation) perform well in com-parison with unsupervised systems, but additional information sources areneeded to achieve a more satisfactory level of accuracy and coverage.

2.3.4 Combination of Information Sources

“What is now needed is further comparative work to see the re-lative strengths and weaknesses of different approaches and toidentify when and how complementary knowledge sources can becombined.” (Carroll and McCarthy, 2000, p. 113)

Most WSD systems do not use a single source of information, but ratherrely on the combination of various features. Some attempts have been madeto identify particularly useful combinations of features. Ng and Lee (1996)introduce a WSD system based on exemplar-based learning including PoS,morphological form, a bag of context words found in the same sentence, localcollocational information, and verb-object syntactic relations (for ambiguousnouns only). The system was tested on the “interest” corpus (Bruce andWiebe, 1994) and the DSO corpus (presented in Ng and Lee (1996) for thefirst time).

An interesting feature used by Ng and Lee is the bag of context words or“unordered set of surrounding words”: within the same sentence as the wordto be disambiguated all word tokens are considered candidate-keywords. Fora candidate to be chosen as keyword its conditional probability with respectto a given sense of an ambiguous word and its frequency of occurrence hasto be greater than some predefined minimum. Also, a maximum number ofkeywords allowed is set in order to select the most frequent keywords only.


When testing their system on the two corpora, the authors separatelytest the different knowledge sources employed. They conclude that localcollocational information yields the highest accuracy, followed by PoS andmorphology. Keywords do not work as well, maybe due to the fact thatonly words within the same sentence (with an average length of 20 words)were taken into account. Verb-object syntactic relations are the weakestknowledge source because they are only applied in the case of ambiguousnouns.

Stevenson and Wilks (2001) also investigate a system which uses a com-bination of knowledge sources for WSD. They have chosen for a combinationof machine learning with a MRD to extract the necessary knowledge andsenses. Their main goal is to optimize the combination of types of lexicalknowledge.

The system works in three phases: pre-processing, disambiguation viamodules, and module combination. Pre-processing consists of tokenizing,PoS tagging and sentence splitting. In a second step, named entities arefiltered out and treated by a separate process, shallow syntactic analysis isperformed, and lexical lookup from LDOCE is performed. Before applyingthe actual disambiguation modules, a PoS filter is applied to the data, whichremoves any senses which do not correspond to the PoS category of a given(content) word. In combination with the lexical lookup from LDOCE thismeans that “senses whose grammatical categories do not correspond to thetag assigned are never attached to the ambiguous word” (p. 332).

Four WSD modules (all partial filters) are incorporated into the system:dictionary definitions, subject codes, selectional restrictions, and a colloca-tion extractor. To optimize the dictionary definition overlap simulated an-nealing is applied (as described in section 2.2.1). Selectional restrictions areused to reduce the number of senses considered or even to resolve the am-biguity. LDOCE contains 36 non-hierarchically organized semantic codesassociated with senses. In order to capture the corresponding level of gen-erality, Stevenson and Wilks organized these semantic codes in a hierarchy.They also used a mapping between the named entities identified during pre-processing and the semantic codes in combination with a shallow syntacticparser (Stevenson, 1998) to build a preference resolution algorithm (which isonly applied to verbs and nouns since adverbs do not have semantic codesin LDOCE). Words in the lexicon have to be categorized into subject areasin order for the subject code algorithm (a re-implementation of Yarowsky(1992)) to work. LDOCE contains pragmatic codes indicating the generaltopic of a text in which a particular sense is most likely to be used. Onlyabout half of the senses in LDOCE are assigned a subject code. For the rest,a dummy code was created to indicate no association with a specific topic.


All these disambiguation modules were then combined in TiMBL (Daele-mans et al., 2002b), an exemplar-based machine learning package. Each sensewhich had not yet been removed by the PoS filter is being presented to thesystem by a separate feature vector. The WSD system was evaluated ona combination corpus of SemCor and Sensus, with an underlying mappingfrom SemCor WordNet senses to LDOCE senses. The results were comparedto the frequency baseline (30.9%) and to the average polysemy, i.e. the num-ber of possible senses we can expect for each ambiguous word in the corpus(14.62). It must be noted that this system can mark more than one senseas correct for a particular token—a side-effect of the one-to-many mappingbetween WordNet and LDOCE senses. The performance is evaluated usingan exact match metric with a twist (since more than one sense can be taggedas correct): The score of a token is computed by dividing the number of cor-rect senses identified by the total it returns. If we have one sense per word,the scoring then corresponds to 1 if the sense returned is correct, and 0 oth-erwise. At sense level, the system performed at 90% accuracy, at homographlevel (a more coarse-grained evaluation) at 94%.

In order to get a better idea of the contribution of each knowledge source,the output of each knowledge source (dictionary definitions, selectional pref-erences, subject codes) is checked against the correct sense. If more thanone sense is assigned the first (i.e. most frequent) sense is chosen. Subjectcodes work best, followed by dictionary definitions and then selectional pref-erences. Selectional preferences work very well for verbs which seem to havethe strongest selectional restrictions (similar conclusion to Resnik (1997)).What emerges clearly from the results is that the combination of partial tag-gers yields better results than the partial taggers independently. This meansthat the combination of orthogonal information sources is useful. Preiss(2004a,b) reaches the same conclusion for a system which combines variousmodules using Bayes rule.

Lee and Ng (2002) present a systematic investigation of the interactionof knowledge sources and supervised machine learning algorithms for WSDtested on the Senseval-1 and Senseval-2 English lexical sample data set.Four types of information sources were tested in their contribution to disam-biguation accuracy: PoS of the ambiguous word w itself and of ±3 neigh-boring words (within the same sentence), single words or keywords in theirmorphological root form in all the surrounding context provided (no stopwords, numbers or punctuation mark), local collocations (within the samesentence) of ±3 words left and right, and syntactic relations in the form ofthe head word and its PoS related to the ambiguous word and (depending onthe PoS of w) voice and/or relative position of the head word. The learningalgorithms tested unfortunately do not include maximum entropy classific-

2.4. Problem of Evaluation 27

ation. The authors test Support Vector machines, AdaBoost with decisionstumps, naive Bayes, and decision trees using the Weka implementation(Witten and Eibe, 2000).

Lee and Ng conclude that there is no single, universally best informationsource since information sources and machine learning algorithms interactand influence each other. Also, different algorithms react differently to fea-ture selection. It is shown, however, that for most algorithms a combinationof information sources gives a better performance than a single source ofinformation.

2.4 Problem of Evaluation

Evaluation is an important matter within the discipline of NLP in general,and in WSD in particular. To evaluate means to compare the results of aparticular system with what is seen as correct solution to the problem atstake. Evaluation can either be intrinsic, i.e. with respect to a gold standarddefined in terms of the task itself, or extrinsic, i.e. in which the performanceof an entire application (containing WSD as a sub-component) is measured.We will first elaborate on the more widely used intrinsic evaluation of WSDsystems, and will then proceed to a brief description of application-orientedevaluation.

In WSD, sense-tagged corpora are needed as gold standards for (intrinsic)evaluation. So far, reliable evaluation data can only be produced throughhand-annotation which is very time and expertise-intensive as well as depend-ent on the skills of the annotator(s). Gale et al. (1992a) review early WSDprograms and present an extensive discussion of WSD evaluation. They notethat the difficulty of the disambiguation task depends on the word chosenand to assess the real performance of WSD programs, they have to be testedon a random sample of a language. They also introduce the upper and lowerbounds for a WSD system which are still standardly used nowadays. Theupper bound is defined by the agreement between human judges whereas thelower bound is found by the accuracy when always choosing the most frequentsense.10

Kilgarriff (1998a) thoroughly discusses the production of Gold Standarddatasets, vital for evaluation. His main point is that high standards of replic-ability can be achieved if the dictionary providing the sense inventory as wellas the human taggers are chosen with care. In order to measure the quality

10For unsupervised systems, the lower bound is computed by randomly assigning a senseto each occurrence of an ambiguous word instead, since these systems do not have anyinformation about the distribution of senses beforehand.


of the hand-annotated text, inter-tagger agreement (ITA) has been estab-lished. This measure compares the compatibility of human judgments onthe tagged data and allows to identify difficult sense distinctions at the sametime. Bruce and Wiebe (1998) tested the ITA (or Inter-Coder Agreement asthey called it) of five human judges on their “interest” corpus (Bruce andWiebe, 1994) in order to find a way to adapt the initial sense tags to a refinedand more reliable set of categories. Cohen’s (1960) κ measure, a coefficientof agreement, was applied to evaluate inter-coder reliability which, in turn,was used to adapt the original sense tag set, e.g. conflating two tags basedon the judgments. Bruce and Wiebe conclude that while their procedureprovides researchers with a refined set of senses using the valuable informa-tion provided by manual annotations, it also, in the process, establishes theupper bound, i.e. the agreement among human judges.11

As we have mentioned before in section 2.1, another difficulty of evalu-ating WSD systems is the question of the sense inventory, specifically whichsenses to assign and at what level of granularity. If two WSD systems usedifferent sense inventories there is no basis on which to compare their per-formance. In the same way, results on different test data sets can hardlybe compared. The use of different additional information sources in varioussystems does not facilitate comparison either.

Resnik and Yarowsky (1997) restarted the issue of evaluation in WSD—which eventually led to Senseval (see section 2.4.1). Their main observa-tions were that WSD evaluation is not standardized, different tasks (might)require different WSD approaches, decent sized sense-tagged data sets donot exist, and that the WSD field is just beginning to focus on which ap-proaches work and which do not. Based on these observations they madefour proposals.

1. Adapt the evaluation criterion from counting only exact hits to analternative scheme giving a positive score to any reduction in ambiguity.

2. Introduce different penalties for minor and gross errors.

3. Set up a common framework for testing and evaluation through pro-ducing a gold standard corpus.

4. Use a multilingual sense inventory in the sense that, if two meanings ofa word are sufficiently different to receive different translations, theirmeaning should be treated as distinct senses.

11Testing the observer differences or bias for five judges on the six senses of interest, κ

ranged between 0.821 and 0.977, with κ = 0 meaning chance agreement and κ = 1 perfectagreement.

2.4. Problem of Evaluation 29

In Resnik and Yarowsky (1999) an extended version of their 1997 paper inthe light of the Senseval-1 exercise can be found including an additionalstudy on translingually motivated sense inventories (see section 2.1).

The trend for proper evaluation is continued with the special issue of Nat-ural Language Engineering 2002 on evaluating WSD systems. Edmonds andKilgarriff (2002) stress the importance of evaluation in order to explain thefact that even though state-of-the-art WSD systems perform better than thebaseline, recent improvements appear quite small. They also state that “[t]heevaluation of WSD has turned out to be as difficult as designing systems inthe first place” (p. 279).

Lately, a more application-oriented definition of evaluation is being takeninto account. If it were made explicit from the start for which particularNLP application WSD was needed, it would make the task itself clearer andwould therefore also help achieve thorough evaluation. Different approachesto disambiguation might prove successful in different areas, such as IR, MT,or parse selection. Vossen et al. (1999b) report on using a WSD module inIR, stressing the importance of evaluating WSD systems on concrete tasks.

“Although the Agirre-Rigau algorithm (Agirre and Rigau, 1996)performs much worse than the First Sense heuristic in terms ofWSD accuracy, it gives slightly better results for IR, as it justfilters the most unlikely senses. This is experimental evidencein favor of evaluating WSD algorithms within concrete tasks, inaddition to general-purpose evaluations such as the Senseval

one.” (Vossen et al., 1999b, p. 89)

2.4.1 Senseval: A Common Evaluation Framework

A first attempt within WSD to setup a common task for several systems inorder to allow for evaluation is Senseval. Senseval-1, held in 1998, was“the first open, community-based evaluation exercise for WSD programs”in which 18 systems participated (Kilgarriff and Rosenzweig, 2000; Kilgar-riff and Palmer, 2000). The setup allowed for supervised and unsupervisedsystems to participate, and included a coarse and fine-grained level of sensedistinctions.

Several choices regarding task design, corpus and dictionary used had tobe made. The task was chosen to be a lexical sample task which means thatonly a (small) set of previously chosen ambiguous words is disambiguated. Anall words approach, in contrast, would mean annotating all ambiguous (con-tent) words in a given corpus. The Hector lexical database (Atkins, 1993)was chosen for corpus and dictionary since this database had not been widely


used in WSD before and was readily available. The results of Senseval-1showed the state-of-the-art for supervised (fine-grained) WSD to be 78% cor-rect. Unfortunately, no precise results on unsupervised systems are reported.It is only stated that for unsupervised systems “scores were both lower andmore variable” (although of the 18 participating systems 10 were supervisedand 8 were unsupervised).

After the success of Senseval-1, Senseval-2 was started in 2000, broad-ening the task to different languages, to a choice between lexical or all wordsdisambiguation, as well as to a more flexible framework (See Edmonds andCotton (2001) for an overview).12

The results for the Senseval-2 English lexical sample task show a muchlower state-of-the-art disambiguation rate for supervised (fine-grained) WSD,namely 64% correct. This amounts to a drop in performance of around 14%in comparison to the Senseval-1 results. According to Kilgarriff (2001), thedifference is due to the different lexicon. For the Senseval-2 English task,WordNet was used as sense inventory. This choice was motivated by the factthat WordNet is very widely used (not only in WSD) and has become almosta de facto standard. The biggest drawback with using WordNet, however, isthat some of the sense distinctions are not clear and/or well-motivated dueto the fact that WordNet is organized around groups of words with similarmeanings (so called synsets), and not around words (as in a dictionary), andit is generally more-fine-grained than the Hector lexicon used in Senseval-1. Also, WordNet was not constructed by trained lexicographers. If the sensedistinctions are not clear to start with, the task of disambiguating is obviouslymore difficult which explains the lower results.

In the context of Senseval-2, the first sense-tagged corpus for Dutch wasmade available (see chapter 4 for a detailed description) which underlines theimportance of Senseval for this project. After the release of the data, newexperiments have been conducted using real ambiguous words (see chapters 4ff.)—in contrast to the preliminary experiments on pseudowords presentedin chapter 3.

In 2004, Senseval-3 has been held using the same setup as in Senseval-2 for different languages (all words and lexical sample tasks) additionally en-larging the competition with a few new, more application-oriented tasks (suchas automatic subcategorization acquisition (Preiss and Korhonen, 2004),WSD of WordNet glosses (Litkowski, 2004b) or automatic labeling of se-mantic roles (Litkowski, 2004a)). Unfortunately, no Dutch task was includeddue to the low interest in Dutch during Senseval-2.

12The data for various languages is available from http://www.senseval.org.

2.5. General Approach 31

2.5 General Approach

The research question that we try to answer in this thesis is: Does theaddition of linguistic knowledge improve a word sense disambiguation systemfor Dutch? We mainly investigate which kinds of linguistic knowledge, inisolation as well as in combination, increase the number of correctly sense-tagged ambiguous words. To that end, we combine statistical parameterestimation techniques (in the form of naive Bayes and maximum entropy)with linguistic cues of different orders which are extracted from the sense-tagged corpus or added during pre-processing. In terms of the categorizationpresented at the beginning of this chapter, we implemented a WSD modulebased on the use of a sense-tagged corpus (corpus-based and supervised)using a probabilistic classifier to achieve disambiguation.

Supervised WSD algorithms need sense-tagged data for training and, es-pecially, for evaluation. An alternative method that has been proposed is theuse of pseudowords, artificially created ambiguous words. To test whetherpseudowords might be a substitute for using annotated data, we comparedthe task of disambiguating real ambiguous words and pseudowords. Our res-ults show that these two tasks are not comparable. Assigning correct sensedistinctions to pseudowords is easier than disambiguating real ambiguouswords which means that pseudowords are not a good substitute for sense-tagged data to evaluate the performance of WSD algorithms. Following theseresults, we continued our research using the publicly available Senseval-2data for Dutch.

A first source of information is the context surrounding each ambiguousword, undeniably a very important feature and employed by most existingWSD systems. Using context as the only source of knowledge, our maximumentropy disambiguation system already clearly outperforms the frequencybaseline. Also, we have found that a small context window of three words toeither side of an ambiguous word provides enough and more precise inform-ation than bigger context sizes.

Since we are working on Dutch, a more heavily inflected language thanEnglish, it seemed useful to integrate morphological knowledge into our WSDsystem as well, a rather novel feature in WSD. This information is employedto group all inflectional variants of a given word together and thereby gen-eralize the clues available to the statistical classification algorithm. As ourresults in chapter 5 show, this technique decidedly improves WSD for a mod-erately inflected language like Dutch. Further improvement can probably beexpected when applying this method to languages with even more inflection.

We also tested the integration of syntactic knowledge in the form of PoSand dependency relations. Due to the format of our data where sense distinc-


tions were not separated according to syntactic class, PoS of the ambiguousword proved to be valuable evidence for our system. In addition, the PoS ofthe surrounding context improves the performance of our WSD system forDutch, a fact only shown for English previously (Hoste et al., 2002a; Lee andNg, 2002). Deep linguistic knowledge in the form of dependency relationsin isolation as well as in combination with other features further amelioratesdisambiguation accuracy, yielding the best results on the tuning data. Bytesting various feature models, we also found that PoS in context and de-pendency relations can be seen as similar sources of information. This meansthat if no parsing output is available, PoS of the context can be used as a(albeit less informative) substitute for deep linguistic information.

Finally, we applied the best model found during tuning to the test data.Our results show that syntactic knowledge is beneficial for WSD and thatespecially the integration of deep linguistic knowledge, such as dependencyrelations, markedly improves disambiguation accuracy. Moreover, the com-bination of orthogonal information sources yields the best results.

To summarize, our model reflects the current trend in WSD research toinvestigate different information sources. In this thesis, we report on exper-iments systematically investigating the influence of different sources of lin-guistic information on disambiguation accuracy for a less studied language,i.e. Dutch. We also include linguistic sources of knowledge which have notextensively been tested for WSD before, such as morphological informationand PoS (in context), in combination with syntactic cues. Each knowledgetype is tested and evaluated independently in order to assess its value forWSD. Our main goal is to determine the relative contribution of each in-formation source in the context of our WSD system for Dutch and whichcombination of linguistic cues works best.

Chapter 3

Initial Experiments:Pseudowords

In order to train and test supervised WSD algorithms annotated data isneeded. An alternative method that has been used are pseudowords, artific-ally created ambiguous words. Since at the time no sense-tagged corporafor Dutch were available, our initial experiments are conducted on this sortof simulated data. We investigated whether the corpus size is of import-ance in WSD for Dutch (section 3.3) and whether a general frequency cutofffor the context words used can be found that consistently improves disam-biguation accuracy (section 3.4). The main question we try to answer insection 3.5 is whether using pseudowords yields results comparable to realWSD and whether they can be seen as equivalent to the disambiguation ofreal ambiguous words.

All the experiments use a supervised ML algorithm, namely naive Bayes(see section 3.2), which is trained on either the European Corpus Initiative(ECI) corpus of Dutch1 or on the (English) Senseval-1 corpus2.

3.1 Pseudowords

The technique of pseudowords consists of introducing a form of artificialambiguity in (untagged) corpora. First of all, two or more words, sensewords,are chosen. Training then takes place on the disambiguated corpus, collecting

1The ECI corpus is a digitally available multilingual corpus distributed by ELSNETwhich contains material on a number of European languages, among others Dutch. Seehttp://www.elsnet.org/eci.html for a complete listing of available languages and or-dering information.

2Publicly available at http://www.senseval.org.

33

34 Chapter 3. Initial Experiments: Pseudowords

probabilities for the chosen sensewords. For testing, all occurrences of thesensewords are replaced by a non-existing word, a pseudoword. The goal isultimately to recover the correct senseword for every pseudoword introducedin the corpus.

Suppose we choose the Dutch sensewords aantal ‘number/amount’ andtijd ‘time/moment’ and combine them to form the (random, non-existing)pseudoword aantijd. The original sentences (1) and (2)—which are usedduring training—will then become test sentences (3) and (4).

(1) Hun aantal groeit en volgens justitie lijkt aan die groei geen einde tekomen.(Their number increases and according to justice there seems no endto the increase.)

(2) Tot die tijd blijven de stellingen betrokken.(Until that moment the assumptions will hold.)

(3) Hun aantijd groeit en volgens justitie lijkt aan die groei geen einde tekomen.

(4) Tot die aantijd blijven de stellingen betrokken.

Evaluation takes place on held-out data of original (disambiguated) sen-tences. Gale et al. (1992d) used pseudowords to overcome the “testing mater-ial bottleneck”, as well as Schutze (1992, 1998), who tried to escape the needfor hand-labeling using artificial ambiguous words for evaluation purposes.

3.2 Naive Bayes Classification

In the case of the preliminary experiments reported here, we work with anaive Bayes classifier (Duda and Hart, 1973) because it is easy to implement,performs relatively well, is rather fast and is used fairly often. The Bayesclassifier used in our experiments only incorporates distributional informationand context words to compute probabilities which corresponds to only usinginformation which is available from the corpus itself without the need of anyadditional material, such as a dictionary or the like.

First, the disambiguation algorithm is trained on part of the unambiguouscorpus, attributing probabilities to the context words found to the right andthe left of the senseword(s) for various context window sizes. Training asused here amounts to counting which sensewords are used in a given context.This is done using Bayes rule:

3.2. Naive Bayes Classification 35

P (sk|c) =P (c|sk)

P (c)P (sk)

where sk is sense k of ambiguous word w in context c = [c1, . . . , cn], thecontext words within the specified context window. Since we are only inter-ested in choosing the correct class, the classification task can be simplified byeliminating P (c) which remains a constant for all senses and therefore doesnot influence what the best class is.

The context words constitute a bag of words which means that they areassumed to be independent of position and of each other. This correspondsto the Bayes independence assumption:

P (c1...cn|sk) = P (c1|sk)P (c2|sk)...P (cn|sk)

It is clearly not true that words are independent of each other, but the simpli-fying assumption allows to adopt an effective model which leads to decisionsthat can still be optimal even if the probability estimates are inaccurate dueto dependencies between features (Domingos and Pazzani, 1997).

Testing takes place on the ambiguous text where the algorithm selects themost probable senseword for each pseudoword according to Bayes decisionrule:

Decide s′ if s′ = arg maxsk

P (sk|c)

= arg maxsk

P (c|sk)

P (c)P (sk)

= arg maxsk

P (c|sk)P (sk)

= arg maxsk

P (c1|sk)...P (cn|sk)P (sk)

Finally, the computed sensewords are compared to the original sensewordsin the disambiguated corpus and the percentage of correctly disambiguatedinstances of pseudowords is calculated. Despite its relatively “naive” ap-proach, the naive Bayes classifier performs relatively well, especially in com-parison with other, more sophisticated approaches (Mooney, 1996; Escuderoet al., 2000c).

Sparse data is a problem in statistical corpus-based WSD. If a contextword has not been seen with a particular sense of an ambiguous word inthe training data, the probability P (ci) of context word ci in the context


of all senses sk of ambiguous word w will be 0. This means that no choicecan be made using the naive Bayes classification algorithm explained above.Smoothing techniques are applied to ensure the proper treatment of infre-quent or unseen data. In the experiments described in the remainder of thischapter, a fixed correction of a probability p = 0.01 has been used for unseendata during testing. Possible extensions would be to apply more sophistic-ated smoothing techniques, such as e.g. Good-Turing (Good, 1953).

3.3 Varying Corpus Size

In a first experiment, we look at the changes in performance of the classi-fication algorithm depending on corpus size. When working with statisticalmethods, changes in corpus size/number of training instances are expectedto be reflected in changes of performance (Langley et al., 1992). The usualassumption is that the bigger the corpus the better the performance, sincemore and hence better counts are available.

3.3.1 Corpus and Pseudowords

The corpus used in this experiment is the ECI Corpus of Dutch which con-tains approximately 3 Million words of raw text. The corpus includes tran-scripts of radio programs, newspaper articles, magazine issues, and sometechnical texts.

Choosing high frequency nouns, six pseudowords are created, four ofwhich consist of two sensewords and two of which consist of 3 sensewords.Table 3.1 gives an overview of the pseudowords, the sensewords they consistof as well as their frequency and the frequency baseline.

3.3.2 Underlying Assumptions

In the experiments described, we proceed from two underlying ideas: “onesense per discourse” and “all information”. The idea of topic coherenceprimary to the “one sense per discourse” heuristic states that words usuallykeep the same sense within a paragraph or document (Gale et al., 1992c;Yarowsky, 1993).3 We therefore restrained the size of the context windowused to paragraphs, which means that if the window size on either side is

3Krovetz (1998) has shown that this is only (partially) true for homonymous senses,but is not the case for polysemous words.

3.3. Varying Corpus Size 37

Pseudoword Sensewords Frequency Baseline

aantijd aantal (amount) 1995tijd (time) 1741 53.31%

lagem land (country) 2991gemeente (municipality) 1018 74.60%

nedmin nederland (Netherlands) 2675minister (minister) 3155 54.11%

prespol president (president) 2356politie (police) 1568 60.04%

neduir nederland (Netherlands) 2675duitsland (Germany) 719irak (Iraq) 2818 45.36%

plonbe plan (project) 1059onderwijs (education) 960beleid (management) 908 36.18%

Table 3.1: Overview Pseudowords.

bigger than the paragraph boundary, everything beyond that boundary isnot taken into account.4

Furthermore, no stop list is used in the reported experiments. One of theworking hypotheses is to assume that taking into consideration all availablecontext information including function words produces good results. In thecase of nouns with different articles, for instance, working with a stop listwould definitely be counter-productive. An example for such an ambiguousword is bal ‘ball’. If used with the determiner de it has the meaning ofeither a ball used to play in sports or, more general, of something round.The expression “het bal” including the determiner het, on the other hand,denotes a public dance. Also, for separable verbs, like opeten ‘to finish eating’or uitleggen ‘to explain’, the (in some instances) separated prepositions (uitand op) are indispensable clues contained in the context which indicate acertain sense with great certainty.

3.3.3 Results and Evaluation

In the reported experiment, results are ten times cross-validated. The contextwindow is restricted to 3 words to the left and the right of the pseudoword.

4There is a big variation in paragraph lengths (1-15 sentences). It is not quite clear yetwhat sort of noise is introduced through this fact.


Pseudoword Baseline 0.5M Words 1.5M Words 3M Words

aantijd 53.31 80.32 84.08 84.97lagem 74.60 78.54 80.08 82.65nedmin 54.11 84.45 83.18 85.02prespol 60.04 73.99 79.09 83.21neduir 45.36 65.83 66.38 70.83plonbe 36.18 58.32 67.25 67.70

Table 3.2: Results with varying corpus size (in %), optimal performance perrow in bold.

We take a similar approach to Chodorow et al. (2000) choosing a fixed contextwindow size of ±3. Similar results can be observed when different contextsizes are used.

The results obtained (see table 3.2) clearly show that more training in-stances help improve the performance of the naive Bayes classification al-gorithm used. The overall performance of the algorithm is quite good, es-pecially considering the fact that the results are purely based on statisticalinformation.

3.4 Varying Thresholds for Context Words

In this second experiment, we look at the use of context words. The main ideais to only use context words of a certain informative value (expressed throughplacing a threshold on the probability of each context word) and to find thepoint at which the informative value of the data is most efficiently exploited,i.e. where the informativeness is maximal with respect to the amount of dataused in the disambiguation process.

The thresholds represent how well a particular context word helps todisambiguate an ambiguous word/pseudoword. A threshold of 1.0 meansthat a context word is only used for disambiguation if the probability ofcontext word ci given sense k of ambiguous word w is 1 (p(ci|sk) = 1).Alternative feature selection methods that could have been applied includeinformation gain, χ2, etc.

The corpus and overall settings are the same as in the experiment reportedin section 3.3. Only the four pseudowords consisting of two senses are used.

3.5. Pseudowords versus Real Ambiguous Words 39

Pseudoword Baseline all 0.6 0.7 0.8 0.9 1

aantijd 53.31 84.97 85.03 84.97 79.83 76.64 72.11lagem 74.60 82.65 82.65 82.62 82.88 81.83 81.60nedmin 54.11 85.02 85.09 84.25 82.85 71.66 69.49prespol 60.04 83.21 83.43 81.97 81.15 79.23 78.63

Table 3.3: Results with varying thresholds (in %), optimal performance perrow in bold.


The purpose of the reported experiment is to test the value of a frequencythreshold on the context words used for disambiguation. It is generally as-sumed that it is better to use fewer, but highly informative features instead ofmany noisy ones. As the results in table 3.3 show there is no clear cutoff valuefor the context words at which the performance of the algorithm improvesfor all pseudowords. A tendency can be observed that using a threshold of0.6 (which means that all context words are used except those which are(almost) equally likely to occur with both senses of a given ambiguous word)works best. A possible explanation for this result might be that there is notenough data to warrant the use of less information through a cutoff.

3.5 Pseudowords vs. Real Ambiguous Words

In the last experiments reported, we investigated whether the task of dis-ambiguating pseudowords is comparable to the task of disambiguating realambiguous words and we reached the conclusion that these two tasks are notequivalent (Gaustad, 2001).

3.5.1 Outline of the Problem

The idea to compare the task of disambiguating real ambiguous words todisambiguating artificially ambiguous words arose from our work on super-vised WSD for Dutch. Since there were no sense-tagged corpora availablefor Dutch at the time, another means of testing algorithms had to be found.An obvious solution is the use of pseudowords: they are easily created, sinceonly raw text material is needed and any supervised algorithm can be tested.The one question that remained unanswered was whether using pseudowordswould yield results comparable to real WSD and whether the seemingly “easy


way out” could really be seen as equivalent to the disambiguation of real am-biguous words.

Unfortunately, there has not been a lot of work on pseudowords and,to the best of our knowledge, no work at all on their usefulness in testingword sense disambiguation systems. The major problem involved in thiscomparison is to find a valid setting for a comparison: the elements to becompared—pseudowords and real ambiguous words —are too different fromeach other to be compared directly. Schutze (1998) explains it in the followingway:

“[The better performance on pseudowords] can be explained bythe fact that pseudowords have two focused senses—the two wordpairs they are composed of.” (Schutze, 1998, p. 109)

Real ambiguous words, on the other hand, consist of subsenses that are oftendifficult to identify for humans as well as for computers.

3.5.2 Way of Proceeding

A direct comparison of the task of WSD and the task of disambiguatingpseudowords is not possible. The only way to compare these two tasks is toindirectly compare their results on the same corpus, using the same algorithmand general settings. The comparison does have its limitations: Although weuse the same settings for both tasks, the difference between them lies in theactual words (or pseudowords) to be disambiguated. There is no measureto express their differences or similarities. This is precisely why there is nopossibility of a direct comparison.

We decided to proceed in two steps. First, real ambiguous words werechosen from the Senseval-1 corpus making use of the dictionary entries aswell as the training and test material provided. Only nouns which were notambiguous regarding part-of-speech and for which there was training datawere taken into account.

In a second step, we chose the sensewords of a pseudoword according tothe frequency distribution of the senses of the real ambiguous words that weretested. Among the possible sensewords that exhibited the same frequencydistributions as the real ambiguous words and which fulfilled the constraintof having approximately the same baseline, an arbitrary selection made.5

5We are aware of the fact that there is a considerable amount of variation in accuracywithin the category of real ambiguous words with similar distributional properties. There-fore, we compare the performance on each real ambiguous word to several pseudowordswith a similar distribution.


If the results of this second task are significantly different from the resultsof the first task on the same corpus, this will show that the results involvingpseudowords depend entirely on the choice of sensewords. This means thatthe disambiguation of pseudowords is not identical to the real WSD task.Note that if one does not have access to sense-tagged corpora, no informationabout the distribution of the senses of real ambiguous words is available,which means that it is not really a comparable setup.

3.5.3 Corpus and Ambiguous Words/Pseudowords

The resource used in this experiment is the English Senseval-1 data set.The advantage of using this material is that it is (lexically) sense-tagged fora number of real ambiguous words which means that the evaluation data forreal ambiguous words is at hand. Furthermore, there have been numerouspublications on the construction of the material, on choices made regardingannotation, on inter-annotator agreement, etc. (cf. (Kilgarriff, 1998b; Kil-garriff and Rosenzweig, 2000), and see also section 2.4), which allow for athorough understanding of the real world disambiguation task. This is animportant precondition to being able to extensively compare this task tonearly the same task using pseudowords.

The perhaps most important factor in this comparison is the choice ofelements of comparison, in this case the ambiguous words and the sensewordschosen to constitute the different pseudowords. The choice of ambiguouswords depended, on the one hand, on the available Senseval-1 material.On the other hand, we only selected nouns which were not ambiguous inpart-of-speech.6 No stemming was used. An overview of the ambiguouswords and their senses7 chosen for the experiments can be seen in table 3.4.8

The main criteria for choosing the sensewords constituting the pseudo-words were their frequency in the corpus as well as their part-of-speech. Forthe comparison with each ambiguous word, five pseudowords were made up.The distribution of these pseudowords’ sensewords was chosen to be as sim-ilar as possible to the distribution of the different senses of the ambiguouswords (see table 3.4 for details).

6A number of ambiguous words in the Senseval-1 material had to be simultaneouslypart-of-speech and lexically disambiguated, e.g. bet, giant, promise. There were also caseswith no training material provided (disability, hurdle, rabbit, steering) which were nottaken into account given that we worked with a supervised algorithm.

7The senses were taken from the Senseval-1 dictionary entries. Only the coarse-grained distinctions were taken into account.

8Since the sense hairsh does not occur in the test data, we decided to only considertwo senses for shirt and, consequently, for the pseudowords.


Amb./Ps.word Senses/S.words Freq. train Freq. test Baseline

Ambig. word accident crash 1058 248 92.88%chance 178 19

Pseudowords timwe time 722 306 91.90%weekend 73 27

yeatra year 708 307 92.47%traffic 86 25

peolang people 673 268 92.10%language 54 23

woan world 422 187 92.12%animal 39 16

goveq government 396 184 92.35%equipment 31 15

Ambig. word behavior social 969 267 95.70%of thing 29 12

Pseudowords peostan people 673 268 93.40%standards 41 19

tima time 722 306 95.33%machine 49 15

yeagro year 708 307 95.34%growth 58 15

wodat world 422 187 94.92%data 36 10

gopay government 396 181 95.26%payment 30 9

Ambig. word excess aglut 103 108 58.06%of or after poss 65 67surplus 10 9too much 73 2

Pseudowords womuconba world 422 187 58.62%music 231 97concert 43 16battle 42 19

gopoemch government 396 184 57.64%police 218 98empire 37 16champion 45 19

dacipapro day 373 161 57.71%city 211 83palace 37 16protection 45 19

pemanora people 673 268 58.64%man 377 154noise 33 16railway 33 19


heterite head 349 150 58.37%team 162 72river 42 16technology 34 19

Ambig. word shirt t-shirt 132 73 57.06%garment 336 105

Pseudowords schoclu school 178 87 59.02%club 140 72

mastre market 190 89 58.55%street 158 63

cimon city 211 83 58.04%month 130 60

coufam country 201 91 57.96%family 117 66

wogia women 189 91 58.33%giants 140 65

Table 3.4: Overview of ambiguous words and correspondingpseudowords.


The results in table 3.5 (on page 44) clearly show that the performanceof the naive Bayes classification algorithm used is significantly better onpseudowords than on real ambiguous words. A possible reason for this isthe relatedness of sense distinctions in real ambiguous words whereas thesensewords that constitute pseudowords have two very clearly distinct senses.

A probable explanation for the fact that the performance on real am-biguous words is quite poor—it constantly fails to reach the baseline—isthat there is not enough training data. Note that the baseline of most am-biguous nouns in the Senseval-1 corpus is relatively high which means thatone sense accounts for most occurrences of the ambiguous word. This makesthe disambiguation task comparatively harder and might be a possible ex-planation for the bad performance on real ambiguous words.

We conclude from the results that the task of disambiguating pseudo-words is comparable only in a limited way to the task of disambiguating realambiguous words. The results on pseudowords will usually be better whichmight lead to false assumptions about the performance of a given algorithmon the real problem.

The results obtained from disambiguating artificial ambiguous words dif-


Baseline Results Difference

accident 92.88 84.45 - 8.43timwe 91.90 91.56 - 0.34yeatra 92.47 91.77 - 0.70peolang 93.10 91.88 - 0.59woan 92.12 93.44 + 0.97goveq 92.35 91.33 - 1.14mean - 0.40 [± 0.89]behaviour 95.70 84.95 -10.75peostan 93.40 92.99 - 0.41tima 95.33 95.64 + 0.31yeagro 95.34 94.04 - 1.30wodat 94.92 93.79 - 1.13gopay 95.26 96.36 + 1.10mean - 0.29 [± 1.24]excess 58.06 50.35 - 7.71womuconba 58.62 71.86 +13.24gopoemch 57.64 72.92 +15.28dacipapro 57.71 73.98 +16.27pemanora 58.64 73.00 +14.36heterite 58.37 74.39 +16.02mean +15.03 [± 1.55]shirt 58.98 57.50 - 1.48schoclu 59.02 72.79 +13.77mastre 58.55 74.83 +16.28cimon 58.04 78.69 +20.65coufam 57.96 63.91 + 5.95wogia 58.33 72.22 +13.89mean +14.1 [± 6.6]

Table 3.5: Pseudowords vs. real ambiguous words: Results (in %).


fer greatly from the results of real ambiguous words. This indicates thatpseudowords cannot be taken as a substitute for testing with real ambiguouswords.9 It might be possible to employ pseudowords for the setting of para-meters. More detailed research has to be conducted, however, to be able toestablish this potential use of pseudowords.

Testing of WSD algorithms is very difficult without evaluation data. Theassumption that artificially created ambiguous words are a good substitutefor real ambiguous words is not valid, as has been shown by the experimentreported here. Thus the initial problem—wanting to test algorithms forlanguages without sense-tagged corpora—remains.

9Nakov and Hearst (2003) have shown that using lexical categories as a basis to selectsensewords instead of choosing them at random leads to more realistic pseudowords andmore accurate lower bounds for WSD systems. Their approach was only tested on alimited domain, however.


Chapter 4

Experimental Setup

The experiments described in chapter 3 in this thesis have shown that theuse of pseudowords to investigate WSD is not a viable option. Since asense-tagged dataset for Dutch has been made available in the context ofSenseval-2, we have performed experiments systematically investigatingthe influence of different sources of linguistic information on disambiguationaccuracy using real ambiguous data instead.

Our WSD system is founded on the idea of combining statistical classi-fication with linguistic sources of knowledge. In order to be able to take fulladvantage of the linguistic information, we need a classification algorithmcapable of incorporating the information provided. A big advantage of max-imum entropy modeling is that heterogeneous and overlapping informationcan be integrated into a single statistical model. Also, no independence as-sumptions as in e.g. naive Bayes are necessary.

In this chapter, we will explain the basic architecture of the WSD systemused henceforth. We will start with a specification of the Dutch Senseval-2corpus, the only sense-tagged corpus available for Dutch and therefore ourprincipal development and test corpus. Next, an explanation of the classifica-tion algorithm and Gaussian priors used for smoothing is given. We will thencontinue with a detailed description of the principle of building individualclassifiers per ambiguous word as well as an example of the implementationof the maximum entropy WSD system for Dutch. After a discussion on ourway of testing the results for significance, we present first results with themaximum entropy system introduced.

47

48 Chapter 4. Experimental Setup

4.1 Senseval-2 Corpus for Dutch

The corpus used in this evaluation is the Dutch part of the Senseval-2 data(Hendrickx and van den Bosch, 2001).1 The Senseval-2 Dutch corpus isthe first electronic word-sense annotated corpus for Dutch (and so far thelast as well). It was originally collected as part of a sociolinguistic researchproject investigating the active vocabulary of children between the age of 4and 12 in the Netherlands (Schrooten and Vermeer, 1994). For this purpose,a realistic word list containing the most common words used at elementaryschool was put together on the basis of 102 illustrated children books.

The corpus was manually sense-tagged by six people, all processing dif-ferent parts of the data. This means that there was no inter-annotator agree-ment or annotation accuracy control and tagging mistakes and/or inconsist-encies are highly likely to be (and have been) found in the corpus.

The sense inventory is a collection of non-hierarchical symbolic sense tagsassembled by the project leaders, based on a basic dictionary of Dutch (VanDale, 1996). Each sense tag is composed of the word’s lemma and a descrip-tion ([baan werk] (job work) vs. [baan koers] (track route)) or gram-matical category ([dood n] (death N) vs. [doden v] (kill V)).2 To distin-guish verb senses, their function in the sentence was sometimes used as adescription ([zijn kww] (be copula) vs. [zijn hww] (be mainverb)).

Even though the symbol “=” is supposedly used for words with onlyone sense (which includes names and sounds), it can nonetheless be foundattributed to a certain word together with other senses. We found 115 casesof unique ambiguous words in the sense inventory with a symbolic sense tagand the tag [=] as their senses.

The dataset also contains annotated multi-word expressions covering idio-matic expressions, sayings, proverbs, and strong collocations. Each wordbelonging to a multi-word expression gets the multi-word expression as itslabel—with the exception of prepositions which all get tagged [* prepositie](* preposition).3

A major problem with this data is that in certain cases the most frequentsense of a word has not been annotated at all (but should be covered by

1For more information on Senseval and for downloads of the data see http://www.

senseval.org and section 2.4.2There is no systematicity in the choice of sense labels. E.g. the ambiguous word

form arm ‘poor/arm’ has two senses based on parts-of-speech ([arm lichaamsdeel]=N(arm bodypart) and [arm rijk]=ADJ (poor rich)), but the annotation label is based ona description.

3In the original data, this was not annotated consistently. We obtained the correctedversion from Antal van den Bosch and Iris Hendrickx from Tilburg University.

4.1. Senseval-2 Corpus for Dutch 49

training section test section

Sentences 9,287 2,999Average number of words/sentence 10.5 10.5Words and punctuation marks 117,338 38,699Words (excl. punctuation marks) 97,187 31,845Ambiguous words (tokens) 55,349 18,528Unique words (types) 8,522 4,269Unique ambiguous word forms (types) 953 512Unique sense tags 4,281 2,401

Table 4.1: Statistics for the training and test section of the Senseval-2 datafor Dutch.

[=]) whereas minor senses or occurrences in multiword units have. See forinstance de ‘the’ (determiner vs. in de watten leggen ‘to pamper’) or er ‘R-pronoun’ (pronoun vs. er uitzien ‘to appear’, er geweest zijn ‘to be deadmeat’, er gaat niets boven ‘there is nothing better than’, etc.). Since thereis no other sense-tagged data available for Dutch, we will have to test ourWSD system on the Senseval-2 data. First, this will enable us to compareour results to other systems. Secondly, we believe that even though the datamay not be ideal, the trends shown on this data are nonetheless valid andcan be extrapolated to different data sets.

In contrast to the English WSD data available from Senseval-2, theDutch WSD data is not only ambiguous in word senses, but also with regardto part-of-speech (PoS). This means that accurate PoS information is im-portant in order for the WSD system to accurately achieve morpho-syntacticas well as semantic disambiguation. It also means that a lot of the semanticambiguity is already resolved by PoS tagging which makes the Dutch data“easier” than the English data.

Let us now turn to a brief statistical overview of the Senseval-2 Datafor Dutch. The training section of the Dutch Senseval-2 dataset containsapproximately 120,000 tokens, 100,000 words, and 9,300 sentences with anaverage number of 10.5 words per sentence. The test section is considerablysmaller with 40,000 tokens, 30,000 words, and 3,000 sentences. The averagenumber of words per sentence with 10.5 is the same, however. Of the totalnumber of words in the training corpus, 56% are actually ambiguous with953 unique ambiguous word form types to be disambiguated. In the testsection, 512 unique ambiguous word form types account for 58% ambiguousdata (see table 4.1 for more detail).


4.2 WSD as Classification Problem

Several problems in NLP have lent themselves to solutions using statisticallanguage processing techniques. Many of these problems can be viewed asclassification tasks in which linguistic classes have to be predicted given a(linguistic) context. In the case of WSD, instances of a particular word withmore than one sense have to be attributed the correct sense or class.

4.2.1 Maximum Entropy Classification

The statistical classifier used in the experiments reported in this thesis is amaximum entropy classifier (Berger et al., 1996; Ratnaparkhi, 1997). Max-imum entropy is a general technique for estimating probability distributionsfrom data. A probability distribution is derived from a set of events based onthe computable qualities (characteristics) of these events. The characteristicsare called features, and the events are lists of feature values.

If nothing about the data is known, estimating a probability distributionusing the principle of maximum entropy involves selecting the most uniformdistribution where all unseen events have equal probability. In other words,it means selecting the distribution which maximizes the entropy. If data isavailable, a number of features extracted from the labeled training data areused to derive a set of constraints for the model. This set of constraintscharacterizes the class-specific expectations for the distribution. So, whilethe distribution should maximize the entropy, the model should also satisfythe constraints imposed by the training data (the empirical frequencies). Amaximum entropy model is thus the model with maximum entropy of allmodels that satisfy the set of constraints derived from the training data.

The model consists of a set of features which occur on events in thetraining data. Training itself amounts to finding weights for each featureusing the following formula:

p(c|x) =1

Zexp

(

n∑

i=1

λifi(x, c)

)

where the property function fi(x, c) represents the number of times feature i

is used with class c for event x, and the weights λi are chosen to maximize thelikelihood of the training data and, at the same time, maximize the entropyof p. Z is a normalizing constant, constraining the distribution to sum to 1,and n is the total number of features.

This means that during training a weight λi for each feature i is computedand stored. During testing, the sum of the weights λi of all features i found in

4.2. WSD as Classification Problem 51

the test instances is computed for each class c and the class with the highestscore is chosen.

A big advantage of maximum entropy modeling is that the features mayinclude any information which might be useful for disambiguation. Thus,dissimilar types of information, such as various kinds of linguistic knowledge,can be combined into a single model for WSD without having to assumeindependence of the different features. Furthermore, good results have beenproduced in other areas of NLP research using maximum entropy techniques(see the references cited in Klein and Manning (2003)).

4.2.2 Smoothing with Gaussian Priors

Since statistical NLP models in general (and therefore also maximum en-tropy models) usually have many features and problems with sparseness (e.g.features occurring in testing not seen during training), smoothing is indis-pensable as a way to optimize the feature weights. In the case of the DutchSenseval-2 data, there is little training data available for many ambiguouswords, therefore making smoothing essential.

The intuition behind Gaussian priors (Chen and Rosenfeld, 2000; Kleinand Manning, 2003) is that the parameters in the maximum entropy modelshould not be too large (both positive and negative) especially when littledata is available. The prior prevents the weights from getting too big whenonly a few examples have been seen. In other words: we enforce the constraintthat each parameter λi will be distributed according to a Gaussian priorwith mean µ and variance σ2. This prior expectation over the distribution ofparameters penalizes parameters for drifting too far from their mean valuewhich is µ = 0. The Gaussian probability distribution is defined accordingto the following formula:

P (λi) =1√2πσi

exp

(

−(λi − µi)2

2σ2i

)

Using Gaussian priors has a number of effects on the maximum entropymodel. We trade off some expectation-matching for smaller parameters.Also, when multiple features can be used to explain a data point, the morecommon ones generally receive more weight. Last but not least accuracygenerally goes up and convergence is faster.

Figure 4.1 illustrates the effect of varying σ2 on the Gaussian distribu-tions. An optimal σ2 (corresponding to the graph associated with σ2 = 3 infigure 4.1) prevents the weights from getting too large without restrainingthem too much. If σ2 is smaller, the weights are forced closer to 0 which


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-10 -5 0 5 10

Pro

bab

ility

Weight

σ2 = 20σ2 = 3σ2 = 1

Figure 4.1: Example of Gaussian distributions with varying σ2.

means that they lose discriminative power and, consequently, accuracy goesdown. If σ2 is bigger, the model resembles the model without smoothingwhich leads to overtraining and, again, to lower accuracy. In the currentexperiments the Gaussian prior, more explicitly σ2, was set to 1000 (basedon preliminary experiments) which led to an overall increase in accuracy (i.e.taking the mean of the results for all classifiers built).

4.3 Building Individual Classifiers

During pre-processing, the corpus is lemmatized (see chapter 5) and PoStagged (see chapter 6). On the basis of the lemmatized and PoS taggedcorpus, for each ambiguous word form4 all instances of its occurrence are ex-tracted from the corpus.5 These instances are then transformed into featurevectors including the features specified in the model.

So a feature vector of sentence (1) containing the ambiguous word formaarde ‘earth/soil’6 corresponding to the model which comprises the PoS of

4A word form is “ambiguous” if it has two or more different senses/classes in thetraining data. The sense [=] is seen as marking the basic sense of a word and is thereforealso taken into account.

5Suarez and Palomar (2002) have shown (for Spanish) that using a single feature modelfor all ambiguous words is less effective than building classifiers with different features perambiguous word.

6This sentence also illustrates the fact that forced sense selection can be arbitrary.

4.3. Building Individual Classifiers 53

aarde and context words would look like example (2):

(1) Een oorverdovende donderslag deed de aarde beven.(A deafening thunderclap made the earth tremble.)

(2) aarde N donderslag deed de beven . = aarde planeet

where the first slot represents the lemma of the ambiguous word, the secondthe PoS, the third to eighth slot are the context words (left before right)and the last slot represents the class. Only context words within the samesentence as the ambiguous word form were taken into account. If therewas e.g. a sentence boundary in the right context (as in this example), it wasfilled with “empty” features (=). Varying the linguistic information included,different feature sets are constructed.

The basic classifiers to which subsequent, more complex classifiers will becompared include either only context words or context words together withthe lemma of the word to be disambiguated. On the basis of the differentfeature sets, separate classifiers are built for every ambiguous word form.This implies that the basis for grouping occurrences of particular ambiguouswords together in one classifier is that their word form is the same. Intotal, there were 953 unique ambiguous word form types and therefore 953classifiers were built.

The context size was kept to three words to the left and to the right of theambiguous word. Experiments reported and evaluated in detail in section 4.6have shown a context size of ±3 context words, i.e. three words to the leftand three words to the right of the ambiguous word, to achieve the best andmost stable results with varying feature models.

With regard to context, four variants were possible: context words couldeither be word forms or lemmas, on the one hand, and they could either betreated as a “bag of words” or ordered, on the other hand. In the lattervariant, the first case means that the position of a context word relative tothe ambiguous word form was not taken into account, whereas in the secondcase it was. Let us explain the difference with an example. In table 4.2 (onpage 54), the context features from sentence (1) are shown with both cases,the bag of words case (“bow”) and the case where the relative ordering istaken into account (“order”).

As is illustrated in table 4.2, when the relative order is taken into account,each context feature is numbered according to its position in the featurevector and the weights will be computed separately for all context words

Most people would argue that a third sense of aarde is involved in this example, namelya sense denoting the ground. In our sense inventory, however, only two senses are listed:[aarde planeet] (earth planet) and [aarde potgrond] (earth soil).


bow order(feature value#position) (feature value#position)

donderslag#2 donderslag#2deed#2 deed#3de#2 de#4beven#2 beven#5.#2 .#6

Table 4.2: Comparison of context features with bag of words (“bow”) andwith relative ordering (“order”).

occurring in a particular position relative to the ambiguous word. When thebag of words approach is applied, all context words get the same positionnumber and will therefore be treated independently of their position in thesentence with regard to the ambiguous word.

This approach was chosen to check our intuition regarding the datasparseness problem: if the context features are all treated with respect totheir position relative to the ambiguous word in the sentence, the model willhave many more features to assign weights to. There are six separate con-text features in this case. This means that the sparse data problem will beworse, but at the same time the information contained in the features is morespecific. If, on the other hand, context features are “lumped” together inde-pendently of their initial position, there are fewer features to be estimatedand there is more data for the single feature “context”. This also means,though, that the feature is more general. We include the results for all fourvariants and evaluate them in section 4.6.

In the experiments presented here, no frequency threshold on the numberof training instances of a given ambiguous word was used. As will be shownin section 4.6, building classifiers even for word forms with very few traininginstances yields better results than applying a frequency threshold and usingthe baseline count (assigning the most frequent sense) for word forms withan amount of training instances below the threshold.

4.4 Implementation

We will now show step for step how our maximum entropy WSD systemworks, starting from the original corpus and ending with the classifier output.The original corpus input (described in detail in section 4.1) contains thewords and their annotated sense as well as sentence boundaries expressed by

4.4. Implementation 55

<utt>. Example (3) shows one sentence from the original Dutch Senseval-2corpus.

(3)

een/een lidwoord

oorverdovende/oorverdovend

donderslag/=

deed/doen maken dat

de/=

aarde/aarde planeet

beven/=

./=

<utt>

We applied several pre-processing steps to this original corpus, namelylemmatization (see chapter 5), PoS tagging (see chapter 6) and parsing fordependency triples (see chapter 7). Also, in the original corpus punctu-ation tokens are not always split, but are sometimes regarded as one token(e.g. a question mark followed by closing quotation marks ?’’). During pre-processing we separated all tokens, regarding each of them as a separatepunctuation mark. The pre-processed version of the corpus can be seen inexample (4).

(4)

een een Art = det een lidwoord

oorverdovende oorverdovend Adj = mod oorverdovend

donderslag donderslag N mod/det su =

deed doen V su/vc/obj1 = doen maken dat

de de Art = det =

aarde aarde N det su/obj1 aarde planeet

beven beven V su vc =

. . Punc = = =

<s> <s> <s> <s> <s> <s>

The first column contains the word form, the second its lemma, the thirdits PoS, the fourth and fifth column are the dependency relations with theambiguous word form as a head and as a dependent respectively, and the lastcolumn contains the sense (or class) of the word.

As we have explained above in section 4.3, one classifier for every ambigu-ous word is built during training on the basis of feature vectors. The firststep is then to find all ambiguous words, i.e. all words that occur more thanonce in the entire corpus and have at least two different classes associatedwith their instances. For each of these ambiguous words, all feature vectorswith the features specified by the model are then extracted from the corpusseen in example (4). For the following steps, we will assume that we includethe lemma and PoS of the ambiguous word as well as a context of ±3 words


as features.

Let us take the ambiguous word form aarde ‘earth/soil’ which has twosenses, namely [aarde planeet] (earth planet) and [aarde potgrond](earth soil), as an example. First, all occurrences of aarde in the entirecorpus are retrieved. Then the relevant features are extracted and savedin feature vector format as input for the maximum entropy classificationalgorithm (see example (5)).

(5) aarde N donderslag deed de beven . = aarde planeet

Once all feature vectors have been prepared as input for the classifica-tion algorithm, disambiguation takes place. In a first step, the maximumentropy disambiguation model is built using the package described in Malouf(2002). This implementation is based on PETSc (the “Portable, ExtensibleToolkit for Scientific Computation”), a software library designed to facilitatethe development of programs which solve large systems of partial differen-tial equations (Balay et al., 1997, 2002).7 Furthermore, Malouf used TAO(the “Toolkit for Advanced Optimization”), a library layered on top of thefoundation of PETSc for solving non-linear optimization problems (Bensonet al., 2002), to implement the limited memory variable metric method whichis used to achieve classification.

As input for this maximum entropy classification algorithm during train-ing, we need to build features of a certain format. Each feature can have anumber of different values associated with it. For example the feature “PoS”can have the values N, V, Adj, etc. Also, a given feature-value pair can occurwith more than one class. The classification algorithm calculates paramet-ers (or weights) associated with each feature-value-class triple based on theevents they occur in. For each training instance, these events include inform-ation on the feature-value pairs found with the correct class of that instance.This means that the training instance represents the relation between thefeature-value pairs and a given class. All training instances together corres-pond to the empirical distribution of the data. So, the formula presented insection 4.2.1 and repeated here

p(c|x) =1

Zexp

(

n∑

i=1

λifi(x, c)

)

models the relation between the distribution p(c|x) and the property functionfi(x, c). The weights λi are used to approximate the distribution with thecounts of the present feature-value-class triples fi(x, c).

7See the PETSc homepage for more information http://www.mcs.anl.gov/petsc.

4.5. Tuning versus Testing 57

At the same time, events include the feature-value pairs which are presentin the current training instance and also occur with the other classes asso-ciated with a particular ambiguous word. In this way, negative data is alsotaken into account when the parameters are computed. Example (6) showsthe features included in the event built for feature vector (5) with the nor-malized weights bound to each feature-value-class triple and computed bythe maximum entropy model during training.

(6)

aarde#0#aarde planeet -0.547343

aarde#0#aarde potgrond 0.547343

N#1#aarde planeet -0.547343

N#1#aarde potgrond 0.547343

donderslag#2#aarde planeet 0.459745

deed#3#aarde planeet 0.459745

de#4#aarde planeet -0.237603

de#4#aarde potgrond 0.237603

beven#5#aarde planeet 0.459745

.#9#aarde planeet 0.459745

As we can see in example (6), this event also includes weights for threefeature-value pairs which occur with the incorrect class [aarde potgrond].The positive weights associated with these features express a strong prefer-ence for this class when these features are present in the test data. In con-trast, negative weights are associated with a different class occurring withthe same features. The two topmost feature-value-class triples in example (6)including the lemma of aarde reflect the prior distribution of the two sensesof aarde: [aarde potgrond] = 0.62 vs. [aarde planeet] = 0.38.8

During the actual classification or test phase, the data is pre-processed asexplained above and all test instances are converted to feature vectors. Onthe basis of the features present in a test instance and their correspondingweights computed during training, a score for each class is computed and theclass with the highest overall score is assigned to the test data as the finalanswer of the classifier.

4.5 Tuning versus Testing

It has by now become common knowledge and practice for statistical corpus-based methods that only the final system should be tested on the real testdata and that for purposes of setting parameters—which is what the inclusion

8Experiments explicitely including the class prior lead to similar, not significantly dif-ferent results.


or exclusion and combination of linguistic information can be seen to be—tuning data should be used. The final test run is then performed on the held-out test data (see e.g. Manning and Schutze (1999)). Since the performanceof corpus-based methods is generally biased by the data, it is important, inorder to minimize this bias, to proceed by cross-validation. This allows us totest the real error of the classifiers and to be able to assess the significanceof the results obtained.

The basic strategy behind the system presented in this thesis is to testvarious linguistic sources of information for their use for WSD. Since thisapproach leads to quite a large number of different feature models that needto be investigated, we prefer to initially test our system on the trainingdata only. For this purpose and to test the real error of the classifiers built,we used a leave-one-out approach (Weiss and Kulikowski, 1991; Manning andSchutze, 1999) on the training data. This means that every data item in turnis selected once as a test item and the classifier is trained on all remainingitems. The accuracy of a single classifier is then the number of data itemscorrectly predicted. The overall accuracy is the total of data items correctlypredicted by all classifiers.

In a leave-one-out setup (unlike ten-fold cross-validation traditionallyused within the ML community), we do not obtain an interval for accur-acy scores, but we are able to assess for each pair of experiments whether theresults are significantly different or not. We opted for this approach becausemany ambiguous words had very few training instances, which means thatusing cross-validation would have forced us to apply a frequency threshold(see section 4.6 for evidence and arguments against a threshold). To testour results for significance we applied a paired sign test (Buijs, 1997; Freund,2004). Observe that, since we are using a paired test, it is not assumed thatthe two experiments are independent of each other. In fact, they shouldbe related to each other so that they create pairs of data points, such asthe measurements on two matched people in a case/control study, or before-and after-treatment measurements on the same person. Since the (test) in-stances of each experiment are also dependent (being based on the sametraining data), we apply a sign test, a simplified form of the Wilcoxon ranktest.

Given n pairs of data, the sign test tests the null hypothesis H0 thatthe median of the differences in the pairs is zero. If the null hypothesis istrue, there is no significant difference between the models, and if, on theother hand, the alternative hypothesis is true, a significant effect is observed.In more detail, each comparison between model A and model B on a giveninstance is a sample with the possible values ‘+’ (model A is better) or‘–’ (model B is better). All samples are drawn according to a binomial

4.6. Results and Evaluation 59

distribution. The instances on which the models perform the same are nottaken into account (these are invalid values in the binomial distribution). Ifthe null hypothesis is true, then the chance of finding a ‘+’ or a ‘–’ shouldtherefore be approximately 0.5. The underlying binomial distribution hasparameters n and π = 0.5. We can then compare the mean number of ‘+’against nπ. If the total number of ‘+’ lies outside the value nπ +1.65σ (1.65being the z-value with α = 0.05), there is a significant difference between thetwo models at a confidence level of 95%.

Let us assume that there are two models A and B which are both testedon 55,000 instances (approximately the number of ambiguous tokens in ourdata). For 54,500 instances both models agree, but for the remaining 500instances model A always performs better. In this case, with π = 0.5 and

σ =√

nπ(1 − π) =√

500 · 0.5 · (1 − 0.5) =√

125, the confidence interval is

nπ+1.65σ = 250+1.65√

125. The null hypothesis H0 is true if n ≤ nπ+1.65σ,i.e. if 500 lies below 269. Since this is not the case, we can say that modelA performs significantly differently. Note that this means that a differenceof only approximately 40 instances where one classifier outperforms anotheryields a reliable indication of improvement (at the α = 0.05 level), even in asample of 55,000 elements. This corresponds to an improvement in accuracyof only 0.07%.

Even if the differences in total accuracy are rather small, they can non-etheless be significant. It is important to note here that effect size (measuredas the difference in accuracy) and statistical significance are not the same andshould therefore not be confused. For all results, we will compare accuracyscores horizontally (if not otherwise stated) and indicate statistically signi-ficant improvements at a confidence level of 95% with †. If the differences inaccuracy are not significant, no mark-up will be used.

4.6 Results and Evaluation

All results in this chapter are being aligned with the frequency baseline inorder to assess the gain in accuracy over the baseline system. We includethe accuracy of the WSD system on all words for which classifiers were built(“ambiguous”) as well as the overall performance on all words, including thenon-ambiguous ones (“all”). Including the results on all words makes ourresults comparable to other systems which use the same data, but a differentnumber of classifiers (e.g. in connection with a frequency threshold applied).All results were obtained using the basic classifier described in section 4.3.

Table 4.3 presents a comparison between the results from the experimentsusing different thresholds. A threshold of 10 means that only for ambiguous


Data ambiguous allThreshold >1 ≥10 >1 ≥10# classifiers built 953 486 953 486

baseline training data 75.64 86.15context words 82.07† 81.75 89.79† 89.61context lemmas 82.29† 81.96 89.92† 89.72lemma, context words 83.32†‡ 82.78 90.50†‡ 90.19lemma, context lemmas 83.43†‡ 82.88 90.56†‡ 90.25

Table 4.3: Comparison of results (in %) with different thresholds on the num-ber of training instances using leave-one-out on training data with a Gaussianprior of 1000, context size ±3, context = “order”; † denotes a significant im-provement over a threshold of ≥10; ‡ denotes a significant improvement overthe model using only context (to be read vertically).

words with 10 or more instances classifiers were built and the baseline classi-fier was applied to the rest of the ambiguous words. All other settings werekept equal, namely a context size of ±3 words to the left and to the right ofthe target word taking into account the relative position of the context word(“order”).

The first thing to mention is that even the basic model including con-text words and nothing else already performs significantly better than thebaseline. Adding the lemma of the ambiguous word as extra linguistic featureimproves accuracy even more. This is the case because the lemma is usuallythe same for all instances of an ambiguous word and, therefore, adding it asa feature includes information on the prior sense distribution of a given wordin the disambiguation process.

Furthermore, the results in table 4.3 clearly show that building classi-fiers for all word forms, even those with very few training instances yieldsbetter results than applying a frequency threshold. We will continue ourexperiments without applying a threshold, thus building 953 classifiers forleave-one-out tuning.

The next setting we investigated was the context size that should beobserved. In the literature one can either find a consensus on using a fairlysmall context window (±3 words, see e.g. Chodorow et al. (2000)) or a verylarge window (±50–100 words (Gale et al., 1992c; Yarowsky, 1992)).9 Weconducted experiments with a context of ±3, ±5, and ±10 words to the left

9A large window is usually only adopted when all context in e.g. an entire paragraphis used.


Data ambiguous allContext size ±3 ±5 ±10 ±3 ±5 ±10

baseline train. data 75.64 86.15context words 82.07 82.10 81.97 89.79 89.81 89.73context lemmas 82.29 82.32 82.19 89.92 89.93 89.85lemma, con. words 83.32†‡ 82.77‡ 82.38 90.50†‡ 90.19‡ 89.97lemma, con. lemmas 83.43†‡ 82.90‡ 82.50 90.56†‡ 90.26‡ 90.03

Table 4.4: Comparison of results (in %) with different context sizes usingleave-one-out on training data with a Gaussian prior of 1000, threshold >1,context = “order”; † denotes a significant improvement over a context of ±5,‡ denotes a significant improvement over a context of ±10.

and the right of the ambiguous word. In the corpus used, no information onparagraphs is available which means that there is no means to know whetherpreceding or following sentences originally belonged to the same documentand should be considered. Consequently, only sentential context is taken intoaccount. We limited our experiments to a maximal context of ±10 becausethe average sentence length is 10.5 words. The results can be seen in table 4.4.

Comparing the results with different context sizes in table 4.4, we seethat the best results are achieved with a context of ±3 lemmas and thelemma of the ambiguous word as features. This confirms earlier findings inthe WSD literature and in human sense resolution (Choueka and Lusignan,1985). Even though a context of ±5 words to the left and right of theambiguous word not including the lemma seems to be working slightly betterthan ±3 (82.10% vs. 82.07% and 82.32% vs. 82.29%), the paired sign testclearly shows that the differences are not statistically significant, whereasfor the results including the lemma they are. Since the results significantlydecrease with increasing context, it does not seem to be necessary to test aneven bigger context than ±10. The average sentence length is 10.5 wordsper sentence in both the training and the test data set which means that,when only taking sentential context into account, a bigger context windowwill generally not add (a lot of) information. Further experiments will alluse a context size of ±3.

Table 4.5 (on page 62) summarizes the results with regard to the fourvariants for context words (context words vs. context lemmas and bag ofwords (“bow”) vs. relative to the position in the sentence (“order”)).

We can clearly see that the results using context features distinguishedfor the relative position with respect to the word to be disambiguated for the


Data ambiguous allOrdering of context order bow order bow

baseline training data 75.64 86.15context words 82.07† 81.03 89.79† 89.20context lemmas 82.29†‡ 81.07 89.92†‡ 89.22lemma, context words 83.32† 81.94 90.50† 89.71lemma, context lemmas 83.43†‡ 81.93 90.56†‡ 89.71

Table 4.5: Comparison of results (in %) with context words vs. contextlemmas and “bow” vs. “order” using leave-one-out on training data with aGaussian prior of 1000, threshold >1, context size ±3; † denotes a significantimprovement over “bow”, ‡ denotes a significant improvement over contextwords (to be read vertically).

context works significantly better than the “bag of words” models. It seemsto be the case that the specificity of the information contained in the featuresrelated to the context of an ambiguous word is more important than datasparseness. We will therefore continue our research retaining the attribute ofposition relative to the ambiguous word for the context (“order”).

An example of when the position of the context word relative to an am-biguous word is important is the case where an ambiguous word is precededby a different definite article depending on its meaning. The Dutch lan-guage knows a limited gender system which only manifests itself with theuse of the definite articles de and het, as well as with the declension ofadjectives when used with the indefinite article een. An example of sucha word is bal ‘ball’: “de bal” has the meaning of either a ball used toplay in sports ([bal spel] (ball game)) or, more general, something round([bal rond] (ball round)) whereas “het bal” denotes a public dance event([bal dansfeest] (ball dance)).

So if we look at the feature vectors (7) and (8) containing bal, it becomesclear that the relative position of words in context is important for disam-biguation. Comparing vectors (7) and (8), the position of the definite articleimmediately to the left of the ambiguous word bal (de and het respectively)is a very important clue for successful disambiguation.

(7) bal N alleen tegen de , maar ook bal rond

(8) bal N is voor het van vanavond , bal dansfeest

Feature vectors (9) and (10) show an example of how the declension ofthe adjective can give an indication for one sense or the other. In most


cases, we find the adjective with a final -e (its declined form), as illustratedin example (9). Only when used with a het-word and an indefinite article,can the undeclined form occur. Since only for the sense of “dance event”[bal dansfeest] it is possible for the undeclined form (without a final -e)to occur, prachtig in example (10) is a strong indication for this particularsense.

(9) bal N grote , bonte te voorschijn . bal spel

(10) bal N het een prachtig ! = = bal dansfeest

Another obvious result is that context lemmas work better since we getless sparse context features and therefore more generalization. Comparingthis finding to classifiers using a bag of words for the context features, thiseffect cannot be observed. Since the bag of words approach already achievesa certain level of generalization, using context lemmas does not have anyfurther effect. On the other hand, the combination of context lemmas andposition relative to the ambiguous word (“order” approach) seems to achievethe desired balance between specific and general level of information. Inthe light of these findings, we will therefore continue only with lemmas ascontext.10

We can thus conclude the following from our first results with the max-imum entropy WSD system for Dutch described in this chapter. Maximumentropy works well as a classification algorithm for WSD: already only in-cluding the context as feature leads to an error rate reduction of at least 26%over the frequency baseline. Also, we have shown that building classifiersfor all ambiguous words works significantly better than applying a frequencythreshold based on the number of training instances, that a context of ±3words performs better than bigger context sizes, and that using context lem-mas for generalization in combination with the relative position of the contextto the ambiguous word achieves better accuracy than context words and/ortreating the context as a bag of words. We will therefore continue with thesesettings as default in all subsequent experiments.

10We are aware of the fact that phenomena, such as e.g. the declension of adjectivesmentioned above, will be lost when context lemmas instead of context words are used. Theresults in tables 4.3, 4.4 and 4.5, however, are clear enough to show that the informationloss with context lemmas is minor in comparison to the overall gain in accuracy.


Chapter 5

Lemma-Based Approach

In this chapter, we focus on a lemma-based approach to WSD for Dutch. Sofar, systems built individual classifiers for each ambiguous word form. In thesystem presented here, classifiers are built for each ambiguous lemma instead.Lemmatization allows for more compact and generalizable data by clusteringall inflected forms of an ambiguous word together, an effect already com-mented on by Yarowsky (1994) in the context of WSD. The more inflectionin a language, the more lemmatization will help to compress and generalizethe data. In the case of our WSD system this means that fewer classifiershave to be built therefore adding up the training material available to the al-gorithm for each ambiguous word form. Accuracy is expected to increase forthe lemma-based model in comparison to the word form model. The hypo-thesis that inflection does not substantially contribute to the disambiguationprocess is implicitly tested at the same time.1

The prerequisite for this approach to work is the availability of an accur-ate lemmatizer for Dutch. We chose to use lemmatization and not stemmingfor our WSD task because the lemma (or canonical dictionary entry form)can be used to look up an ambiguous word in a dictionary or an ontologylike e.g. WordNet. This is not the case for a stem. A stemmer reduces allword forms to the same base, ideally but not always the stem or morpholo-gical root, whereas a lemmatizer returns the lemma or (dictionary) citationform associated with a given word form. Table 5.1 gives a few examples ofstemming and lemmatization results illustrating the differences between thetwo processes.

1A few English WSD approaches have used plural inflection as a feature, see e.g. Bruceand Wiebe (1994) and Pedersen (1998), but without evaluating its value for WSD. Eventhough there are a few examples in Dutch where the singular-plural distinction can leadto ambiguity, for instance medium–media (medium–media), we expect more benefit thanharm from the conflation of inflected forms for Dutch.

65

66 Chapter 5. Lemma-Based Approach

Word form Stem Lemma

schrijven (to write) schrijf schrijvengeschreven (written) schrijf schrijvenregering (government) regeer regeringhuisje (little house) huis huisjeboompje (little tree) boom boomafdwingen (to force) afdwing afdwingenopenbaar (public) open openbaar

Table 5.1: Examples of stemming and lemmatization of Dutch words.

However, a newly built lemmatizer could not easily be evaluated becauseno stand-alone lemmatizer for Dutch was freely available2 whereas a stem-mer for Dutch exists in the form of the Dutch Porter Stemmer (Kraaij andPohlman, 1994). We therefore chose to implement a tool which could be usedas a stemmer or a lemmatizer, evaluating it in a comparison to a prevailingstemmer, namely the Dutch Porter Stemmer.

We will first introduce the stemmer which combines dictionary lookup(implemented efficiently as a finite state automaton) with a rule-based backupstrategy and show that it outperforms the Dutch Porter stemmer in terms ofaccuracy, while not being substantially slower (section 5.2). Next, we will fo-cus on the lemmatizer and how it is different from the stemmer (section 5.3).The lemmatizer will then be used in the lemma-based approach to WSDmentioned at the beginning of this introduction and explained in more detailin section 5.4. The results of the lemma-based approach will be presentedand evaluated in section 5.5.

5.1 Accurate Stemming of Dutch

Statistical classification systems compute the most likely class for an instanceby computing how likely the words and n-grams in the text are for anygiven class. Estimating these probabilities is difficult, as texts contain lotsof different, often infrequent, words. One way to deal with this problem isto take into consideration only words which occurred at least n times in agiven training corpus, as estimation is more reliable for frequently occurringwords.

2With the exception of the Memory-based lemmatizer MBLEM (http://ilk.uvt.nl/mblem) of which we were not aware at the time.

5.1. Accurate Stemming of Dutch 67

Stemming is another method which can be used to reduce the number ofword forms that need to be taken into consideration. Stemming reduces allinflected and derivational forms of a word to the same stem. The numberof different stems in a text or training corpus can therefore in general beexpected to be much smaller than the number of different word forms, andthe frequency of stems will therefore be higher than that of the correspondingindividual inflected forms, which in turn suggests that probabilities can beestimated more reliably.

Stemming is a well-known technique in Information Retrieval (IR) sys-tems where the main goal is to retrieve the documents that correspond toa given query. In classification tasks, stemming was conceived as a way ofreducing morphological variants to a single (indexing) term. Experiments todetermine the effectiveness of stemming have produced mixed results. Oneimportant factor is the language of the documents involved. Harman (1991)compared the performance of data stemmed with three suffix-stripping al-gorithms for English against unstemmed data in IR queries and came tothe conclusion that stemming does not consistently improve performance.Krovetz (1993), on the other hand, concludes that accurate stemming ofEnglish does improve performance of an IR system. His conclusion is con-firmed by Monz (2003).

Popovic and Willett (1992) investigated whether stemming would havemore effect for a morphologically complex language like Slovene. They foundthat precision of the retrieved documents was increased when suffix-strippingwas used. Dutch is a language which is morphologically more complex thanEnglish, but not as complex as the Slavic languages. Kraaij and Pohlman(1996) found that both a stemmer using their adaptation of the Porter al-gorithm for Dutch (a well-known suffix-stripping algorithm) and a dictionary-based stemmer led to a decrease in IR-performance compared to using nostemming. However, reinterpreting earlier results, Kraaij (2004) states thatvarious stemming methods all improve IR results for Dutch.

Recently, stemming has also been applied to text classification, the task ofclassifying the topic or theme of a document. Just as in IR, experiments leadto mixed results. On the basis of her experiments for English text classifica-tion, Riloff (1995) concludes that “stemming algorithms may be appropriatefor some terms but not for others” and that classification systems would be-nefit from using all available information, including morphological variants.Busemann et al. (2000), on the other hand, have shown that morphologicalanalysis increases performance for a series of classification algorithms appliedto German email classification. Spitters (2000) compares, among others, theperformance of two machine learning algorithms for topic classification ofDutch newspaper articles, using both unstemmed text and text stemmed


with the Dutch Porter stemmer. He concludes that stemming does not im-prove the performance of either algorithm.

In Gaustad and Bouma (2002), the use of stemming for classification ofDutch email and newspaper texts was investigated. In a comparison betweenthe Dutch Porter stemmer and a stemmer with dictionary lookup, the stem-mer with dictionary lookup outperformed the rule-based suffix stripper interms of accuracy. For text categorization, however, accuracy and the abilityto reduce related word forms to a single stem, is of importance. This as-pect of the performance of the two stemmers has also been addressed in theapplication-specific evaluation of stemming in Gaustad and Bouma (2002).The dictionary-based system not only outperforms the Porter stemmer inaccuracy, but also in the number of word forms being reduced to a singlestem. Nevertheless, evaluation of a Bayesian text classification system witheither no stemming or the Porter or dictionary-based stemmer on an emailclassification and a newspaper topic classification task does not lead to sig-nificant differences in accuracy. This confirms the mixed results that havebeen reported in the literature.

We will now introduce the newly built dictionary-based stemmer, as wellas the Dutch Porter stemmer and evaluate them in terms of accuracy andspeed on a corpus of manually annotated text.

5.2 Stemmers

A stemmer tries to reduce various forms of a word to a single stem. For Dutch,for instance, a stemmer might reduce various forms of the verb schrijven (towrite) such as schrijf, schrijft, schrijven, schreef, schreven, and geschrevento the stem schrijf. Stemming in general requires that inflected word formsare reduced to a stem. A simple and robust method works by removing onlycertain inflectional suffixes and undoing the effect of certain orthographicrules (i.e. the letter -f in the coda of a Dutch word sometimes correspondsto a -v in the stem). More accurate methods, able to deal with irregularmorphology (like the fact that the past tense singular form of the stem schrijfis schreef ), require a dictionary.

Below, we first briefly describe the Dutch Porter stemmer (DPS) of Kraaijand Pohlman (1994). Next, we present a stemmer for Dutch, with dictionarylookup and a rule-based backup strategy (SteDL). Finally, we present someexperimental results which confirm that the new stemmer is linguisticallymore accurate than the Porter stemmer, while not being substantially slower.

5.2. Stemmers 69

5.2.1 Dutch Porter Stemmer

The Porter stemmer (Porter, 1980) is a rule-based suffix stripper which iswidely used in IR systems. Porter’s algorithm implements a series of stepswhich each remove a certain type of suffix by way of substitution rules.These rules only apply when certain conditions hold, e.g. the resulting stemmust have a certain minimal length. Kraaij and Pohlman (1994) developeda Porter stemmer for Dutch3 which uses the implementation presented inFrakes and Baeza-Yates (1992). It removes plural -en and -s suffixes, theverbal -t suffix, diminutive inflection (realized by various suffixes ending in-je), a number of common derivational morphemes, and undoes the effect ofthe spelling rule which requires consonant doubling in certain contexts.

The advantages of this simple suffix stripper are that it is very robust andthe implementation is fairly easy. It is also clear that it will often producethe wrong stem for a word. The derivational suffix -ing, for example, canbe used to form nouns from verbal stems (i.e. regering (government) fromregeer (govern)). However, simply stripping the suffix -ing from all matchingwords also produces for the noun onding (gimcrack) the nonsense form ond.Such mistakes need not be fatal: as long as words are reduced to a uniquestem form, no information is lost. Potentially harmful mistakes (known asoverstemming) occur when a word is reduced to a semantically unrelatedstem. For instance, the noun gulden (guilder) ends in -den, which is a pasttense plural suffix, and therefore is reduced to gul, which is an adjectivemeaning generous.

Another weak spot of the algorithm is that it has no way to handle irreg-ular forms. Dutch has a large number of so-called strong verbs, whose pastand participle forms have root vowels which differ from that in the presenttense root (i.e. present tense nemen (to take) has a past tense namen anda participle genomen). Such forms will not be reduced to the same stem,a mistake known as understemming. The most frequent verb forms tendto be strong, and thus they are an important source of understemming. Arigorous evaluation of the Dutch Porter stemmer, reporting overstemming,understemming, and IR performance, can be found in Kraaij and Pohlman(1995).

5.2.2 Stemmer with Dictionary Lookup

It is obvious that including dictionary information will have a positive ef-fect on the accuracy of a stemmer. The linguistically correct form will beprovided more often, which might be useful for some applications, and also

3Available at http://www-uilots.let.uu.nl/~uplift/.


the percentage of overstemming and understemming errors are likely to godown. The latter should have a positive effect in applications like IR or textclassification.

In order to test whether a linguistically more accurate stemmer wouldperform better than a suffix stripper, we used various existing resources todevelop an alternative stemmer with dictionary lookup.

Dictionary information is obtained from Celex (Baayen et al., 1993).Celex contains 381,292 word forms and 124,136 stems for Dutch. Fur-thermore, it contains information about the frequency of word forms. Thisinformation is useful for disambiguation: in those cases where a word formis listed with two different stems, the most frequent stem can be chosen. Inan initialization step, information about word forms, their respective stemas well as frequency is extracted from the database.

Dictionary lookup can be time consuming, especially for large dictionariessuch as Celex. The extracted lexical information is therefore stored as afinite state automaton, using Daciuk’s (2000) finite state automata (FSA)morphology tools.4 Given a word form, the compiled automaton providesthe corresponding stems in time linear to the length of the input word andindependent of the size of the dictionary. Moreover, the FSA is more compactthan e.g. a trie data structure. In a trie structure, common prefixes of wordsare merged, whereas in the FSA representation, both common prefixes andsuffixes are merged (for more details see Daciuk and van Noord (2004) andreferences cited there). As a backup strategy for words which are not foundin the dictionary, we use the Dutch Porter stemmer (DPS) described above.

The actual stemming procedure is shown in figure 5.1. The FSA encodingof the information in Celex assigns every word form all its possible stems.For ambiguous forms, the most frequent stem is chosen. All words that werenot found in Celex are processed with DPS.

5.2.3 Stand-Alone Evaluation

In order to be able to assess the contribution of stemming in text classific-ation, it is crucial to compare the performance of the DPS and the SteDLindependently of a specific application first. For this purpose, we manuallyprepared a corpus with a stemmed gold standard. The corpus consisted oftexts from Dutch children’s books and contained ca. 45,000 words.5

Stemming accuracy on the test corpus was 98.23% for SteDL and 79.23%for DPS. One has to bear in mind, however, that DPS also strips derivational

4Available at http://www.pg.gda.pl/~jandac/fsa.html.5This corpus consists of the first half of the training data from the Dutch Senseval-2

WSD task.

5.2. Stemmers 71

Datasetp

FSADictionary Lookup

CelexpDPS

(Backup Strategy)

Disambiguation

pStemmed Data

if not in Celex

stems

freqs

Figure 5.1: Diagram of the alternative stemmer with dictionary lookup(SteDL).

Stemmer Accuracy

DPS 79.23%SteDL (no frequency info) 96.27%SteDL 98.23%

Table 5.2: Accuracy of the Dutch Porter stemmer and the dictionary-basedstemmer for Dutch on a 45,000 word evaluation corpus.

suffixes whereas in the gold standard these were retained. We estimate thatapproximately 4-5% of the difference in accuracy is due to the removal of de-rivational suffixes. The remaining 10% of the difference in accuracy is mainlydue to the fact that the stemmer with dictionary lookup correctly stemsirregular verb forms (e.g. auxiliaries, modals)—which are very frequent—whereas DPS does not. Even when taking these differences into account, thedictionary-based stemmer still clearly outperforms the rule-based stemmerin terms of accuracy.

The contributions of various components of SteDL can also be evaluated.DPS was used as a backup strategy in only 2.98% (1,339 out of 44,905) ofthe cases, which means that the lexical coverage of Celex for the evaluationcorpus is fairly good. We also compared SteDL with a version of the systemwhere a random stem instead of the most frequent form was chosen. This ledto a stemming accuracy of 96.27%. Thus, including frequency informationreduces the error rate with approximately 50%. Table 5.2 summarizes thecomparison regarding accuracy.


Finally, SteDL is not substantially slower than DPS. After having builtthe dictionary-FSA (which only needs to be done once during initialization,takes 59 seconds and results in an FSA of 2.44Mb), the stemming process onthe 45,000 word evaluation corpus takes 14 seconds with SteDL, whereas ittakes 5 seconds with DPS (on a XP900 workstation). The dictionary lookupFSA itself is very fast (0.5 seconds), but the scripts making up the completesystem have not been optimized for speed, which explains the difference intime. An obvious improvement would be to integrate dictionary lookup anddisambiguation into one FSA.

5.3 Dictionary-Based Lemmatizer for Dutch

Just like the statistical classification explained in general in section 5.1, ourWSD system also determines the most likely class for a given instance bycomputing how likely the words or linguistic features in the instance are forany given class. Estimating these probabilities is difficult, as corpora containlots of different, often infrequent, words. Comparable to stemming, lemmat-ization is a method that can be used to reduce the number of word forms thatneed to be taken into consideration, as estimation is more reliable for fre-quently occurring data. In the context of WSD, we opted for lemmatizationand not stemming because the lemma can be used to look up an ambiguousword in a dictionary or an ontology like e.g. WordNet, a clear advantage ifthis kind of information will be integrated into the WSD system.

Lemmatization reduces all inflected forms of a word to the same lemma.The number of different lemmas in a training corpus will therefore in generalbe much smaller than the number of different word forms, and the frequencyof lemmas will therefore be higher than that of the corresponding individualinflected forms, which in turn suggests that probabilities can be estimatedmore reliably.

For the WSD experiments in this chapter (and, in general, in this thesis),we used a lemmatizer for Dutch with dictionary lookup. Again, Celex

provides the dictionary information which also contains the PoS associatedwith the lemmas. In contrast to the stemmer (where the frequency of wordforms is used for disambiguation), in the case of the lemmatizer PoS informa-tion is exploited to achieve disambiguation: in those cases where a particularword form has two (or more) possible corresponding lemmas, the one match-ing the PoS of the word form is chosen. Thus, in a first step, informationabout word forms, their respective lemmas and their PoS is extracted fromthe database.

Just as with the stemmer, fast lookup and compact representation is

5.4. Introducing the Lemma-Based Approach 73

guaranteed by storing the information extracted from the dictionary as aFSA (Daciuk, 2000). During the actual lemmatization procedure, the FSAencoding of the information in Celex assigns every word form all its possiblelemmas. For ambiguous word forms, the lemma with the same PoS as theword form in question is chosen. All word forms that were not found inCelex are processed with a morphological guessing automaton.6 The keyfeatures of the lemmatizer employed are that it is fast (12 seconds for thelemmatization process of our data), compact (1.4Mb) and accurate.

5.4 Introducing the Lemma-Based Approach

As we mentioned in section 5.3, lemmatization collapses all inflected forms ofa given word to the same lemma. In our system, separate classifiers are builtfor every ambiguous item. Normally, this implies that the basis for groupingoccurrences of particular ambiguous words together is that their word formis the same. Alternatively, we chose for a model constructing classifiers basedon lemmas therefore reducing the number of classifiers that need to be made.

As has already been noted by Yarowsky (1994), using lemmas helps toproduce more concise and generic evidence than inflected forms. Thereforebuilding classifiers based on lemmas increases the data available to each clas-sifier. We make use of the advantage of clustering all instances of e.g. oneverb in a single classifier instead of several classifiers (one for each inflectedform found in the data). In this way, there is more training data per ambigu-ous word form available to each classifier. The expectation is that this shouldincrease the accuracy of our maximum entropy WSD system in comparisonto the word form-based model.

Figure 5.2 (on page 74) shows how the system works. During training,every word form is first checked for ambiguity, i.e. whether it has more thanone sense associated with all its occurrences. If the word form is ambiguous,the number of lemmas associated with it is looked up. If the word form hasone lemma, all occurrences of this lemma in the training data are used tomake the classifier for that particular word form—and others with the samelemma. If a word form has more than one lemma, a classifier based on theword form is built. This strategy has been decided on in order to be ableto treat all ambiguous words, notwithstanding lemmatization errors or wordforms that can genuinely be assigned two or more lemmas.

An example of a word that has two different lemmas depending on thecontext is boog : it can either be the past tense of the verb buigen (to bend)

6Available from Daciuk’s (2000) FSA morphology tools, http://www.pg.gda.pl/

~jandac/fsa.html.


non-ambiguous psense

pword formLEMMAMODEL

psense

ambiguous

WORD FORMMODEL

psense

1 sense

X senses 1 lemma

X lemmas

Figure 5.2: Schematic overview of the lemma-based approach building ourWSD System for Dutch.

or the noun boog (arch). Since the Dutch Senseval-2 data is not onlyambiguous with regard to meaning but also with regard to PoS, both lemmasare subsumed in the word form classifier for boog.

During testing, we check for each word whether there is a classifier avail-able for either its word form or its lemma (depending on which was builtduring training as explained above) and apply that classifier to the test in-stance.

It is important to note here that we used the test section of the Senseval-2 data, not the leave-one-out method described in section 4.5, to test thelemma-based approach. The main reason for this is that we use all instancescontaining a given lemma for training, also those items where the correspond-ing word form is not ambiguous. If we applied the leave-one-out method onthis data for testing, we would evaluate on instances that are not containedin the word form based classifiers. Therefore, accuracies between the twoapproaches could not easily be compared.

The features we used in the experiments we report includes informationon the word form or its lemma, its PoS, context lemmas to the left and rightas well as the context PoS, and its sense or class.


To be able to evaluate the results from the lemma-based approach, we alsoinclude results based on word form classifiers. During training with wordform classifiers, 953 separate classifiers were built. With the lemma-basedapproach, 669 classifiers were built in total during training, 372 based on thelemma of an ambiguous word (subsuming 656 word forms) and 297 based on


lemma-based word forms

Training # classifiers built 669 953based on word forms 297 953based on lemmas 372 na

# word forms subsumed 656 naTesting # unique ambiguous word forms 512 512

# classifiers used 307 410based on word forms 237 410based on lemmas 70 na

# word forms subsumed 208 na# word forms seen 1st time 74 102

Table 5.3: Overview of classifiers built and used with the lemma-based ap-proach and with word forms as basis.

ambiguous all

baseline test data 78.47 89.44word form classifiers 83.66 92.37lemma-based classifiers 84.15† 92.45†

Table 5.4: Results (in %) on the test section of the Dutch Senseval-2 datawith the lemma-based approach compared to classifiers based on word forms;† denotes a significant improvement over the word form classifiers (to be readvertically).

the word form. A total of 512 unique ambiguous word forms were found inthe test data. 445 of these were classified using the lemma-based classifiersbuilt from the training data, whereas 410 could be classified using the wordform model (see table 5.3 for an overview).

We include the accuracy of the WSD system on all words for which clas-sifiers were built (“ambiguous”) as well as the overall performance on allwords (“all”), including the non-ambiguous ones. This makes our resultscomparable to other systems which use the same data, but a different num-ber of classifiers (e.g. in connection with a frequency threshold applied). Thebaseline has been computed by always choosing the most frequent sense of agiven word form in the test data.

The results in table 5.4 show the average accuracy for the two differentapproaches. The accuracy of both approaches improves significantly (whenapplying a paired sign test with a confidence level of 95%, as explained in


ambiguous #classifiers

baseline 76.77 192word form classifiers 78.66 192lemma-based classifiers 80.39† 70

Table 5.5: Comparison of results (in %) for lemma-based and word form-based approach for words with different models only; † denotes a significantimprovement over the word form classifiers (to be read vertically).

section 4.5) over the baseline. This demonstrates that the general idea of thesystem, to combine linguistic features with statistical classification, workswell. Focusing on a comparison of the two approaches, we can clearly seethat the lemma-based approach works modestly, but significantly better thanthe word form only model, thereby verifying our hypothesis.

Another advantage of the approach proposed, besides increasing the clas-sification accuracy, is that fewer classifiers needed to be built during trainingso that the WSD system based on lemmas is smaller. In an online application,this might be an important aspect of the speed and the size of the applic-ation. It should be noted here that the degree of generalization throughlemmatization strongly depends on the data. Only inflected word forms oc-curring in the corpus are subsumed in the corresponding lemma classifier.The more varied inflected forms the training corpus contains, the better the“compression rate” in the WSD model. Added robustness is a further assetof our system. More word forms could be classified with the lemma-basedapproach compared to the word form-based one (445 vs. 410).

In order to better assess the real gain in accuracy from the lemma-basedmodel, we also evaluated a sub-part of the results for the lemma-based andthe word form-based model, namely the accuracy of those word forms whichwere classified based on their lemma in the former approach, but based ontheir word form in the latter case. The comparison in table 5.5 clearly showsthat there is much to be gained from lemmatization. The fact that inflectedword forms are subsumed in lemma classifiers leads to an error rate reductionof 8% and a system with fewer than half as many classifiers.

In table 5.6, we see a comparison with another WSD system for Dutchwhich uses Memory-Based Learning (MBL) in combination with local con-text (Hendrickx et al., 2002). This system has an architecture similar to oursin that it is an ensemble of word experts (each specialized in the disambig-uation of one particular ambiguous word form). The features Hendrickx etal. employed also consist of three words to the left and to the right of the


ambiguous all

baseline test data 78.5 89.4word form classifiers 83.7 92.4lemma-based classifiers 84.1 92.5(Hendrickx et al., 2002) 84.0 92.5

Table 5.6: Comparison of results (in %) from different systems on the testsection of the Dutch Senseval-2 data.

ambiguous word as well as their PoS tag. A big difference with the sys-tem presented in this thesis is that extensive parameter optimization for theclassifier of each ambiguous word form has been conducted for the MBL ap-proach. Also, a frequency threshold of minimally 10 training instances wasapplied, using the baseline classifier for all words below that threshold.

As we can see, our lemma-based WSD system scores the same as theMemory-Based WSD system, without extensive “per classifier” parameteroptimization. According to Daelemans and Hoste (2002), different machinelearning results should only be compared once all parameters have been op-timized for all classifiers. This is not the case in our system, and yet itachieves the same accuracy as an optimized model. Optimization of paramet-ers for each ambiguous word form and lemma classifier might help increaseour results even further, but since we only expect minor improvements wehave not pursued this issue any further.

Summarizing our findings on a lemma-based approach to WSD for Dutch,we can conclude that this novel approach uses the advantage of more con-cise and more generalizable information contained in lemmas as key feature:classifiers for individual ambiguous words are built on the basis of their lem-mas, instead of word forms as has traditionally been done. Therefore, moretraining material is available to each classifier and the resulting WSD systemis smaller and more robust.

The lemma-based approach has been tested on the Dutch Senseval-2test data set and resulted in a modest, but statistically significant improve-ment of the accuracy achieved over the system using the traditional wordform based approach. In comparison to earlier results with a Memory-BasedWSD system, the lemma-based approach performs the same, without para-meter optimization.


Chapter 6

Impact of Part-of-SpeechInformation

After having explored the benefit of using morphological information forWSD of Dutch, we will now proceed to integrate part-of-speech (PoS) inform-ation into the WSD system explained in chapter 4. The Dutch Senseval-2WSD data is not only ambiguous with regard to meaning, but also with re-gard to PoS. This means that accurate PoS information is important sincethe WSD system is supposed to do morpho-syntactic as well as semanticdisambiguation. First, we investigate which of three different PoS taggersperforms best in our system in an application-oriented evaluation. Followingthe strategy that high quality input is likely to influence the final results ofa complex system, we test whether the more accurate taggers also producebetter results when integrated into the WSD system. For this purpose, astand-alone evaluation of the PoS taggers is used first to assess which taggeris the most accurate.

There are two possible outcomes to the experiment of integrating PoSinformation in our WSD system. Either PoS does not improve the perform-ance of the algorithm because the information added by the PoS tags isalready implicitly contained in other features (or because it is irrelevant), orPoS does help disambiguation since PoS disambiguation is part of the ini-tial problem in the case of the Dutch Senseval-2 Data. The results of theWSD task including the PoS information from all three taggers show thatincluding PoS information in the WSD system improves accuracy and thatthe most accurate PoS tags indeed lead to the best results, thereby verifyingour hypothesis.

In the second part of the chapter, not only PoS of the ambiguous wordform, but also PoS of the context is added. This allows us to test the disam-biguation value of PoS information on a greater scale and in a novel way. The

79

80 Chapter 6. Impact of Part-of-Speech Information

results show that accurate PoS information is beneficial for WSD and thatincluding the PoS of the ambiguous word itself as well as PoS of the contextincreases disambiguation accuracy over the system presented in chapter 4.

6.1 Application-Oriented Evaluation of Three

PoS Taggers

Certain NLP tools are typically used as a sub-component or a pre-processorin a more complex system, rather than as a complete application in theirown right. A typical example of such tools are PoS taggers. What is usuallynot taken into account is the fact that the quality (in terms of accuracy) ofeach sub-part of a complex system is likely to influence the final results con-siderably. Lately, standardized evaluation of NLP resources has gained moreimportance in the field of computational linguistics (e.g. CLEF workshopsin information retrieval, Parseval, Senseval), but a tendency towards moreapplication-oriented evaluation is only beginning.

In the first part of the chapter, we will proceed to an application-orientedcomparison of three PoS taggers in a word sense disambiguation (WSD) sys-tem. We will evaluate to what extent differences in stand-alone PoS accuracyinfluence the results obtained in the complex WSD system using the acquiredPoS information. Since the Dutch data we use is not only ambiguous withregard to meaning but also with regard to PoS, accurate PoS information ispotentially very important to achieve high disambiguation accuracy.

We will start with a detailed description and comparison of the three PoStaggers including a stand-alone evaluation in order to compare their perform-ance independently of the application to the WSD task. Then follows a shortdescription of how the output of the different PoS taggers is incorporated intothe maximum entropy WSD system for Dutch explained earlier. Next, theapplication-dependent results of the three PoS taggers will be presented anddiscussed.

6.2 Comparison of PoS Taggers

The PoS taggers we compare are:

• a Hidden Markov Model tagger (section 6.2.1),

• a Memory-Based tagger (section 6.2.2),

• a transformation-based tagger (section 6.2.3).

6.2. Comparison of PoS Taggers 81

We chose these three taggers because they were readily available, could easilybe trained for Dutch without major changes in the architecture, and representdistinct, widely used types of existing PoS taggers.

All three taggers were trained on the Dutch Eindhoven corpus (uit denBoogaart, 1975) using the WOTAN tag set (Berghmans, 1994). The originalWOTAN tag set, consisting of 233 tags, was too detailed for our purpose.Instead, we used the limited WOTAN tag set of 48 PoS tags developed byDrenth (1997) for training and testing in the stand-alone comparison of thethree PoS taggers.

In the context of our WSD system, however, we are chiefly interestedin the main PoS categories only. Therefore, we discarded all additional in-formation from the assigned PoS tags in the WSD corpus. This resulted in12 different tags being kept: Adj (adjective), Adv (adverb), Art (article),Conj (conjunction), Int (interjection), Misc (miscellaneous), N (noun), Num(numeral), Prep (preposition), Pron (pronoun), Punc (punctuation), and V(verb).1

For the stand-alone results, 80% of the annotated data was actually usedfor training, 10% for tuning (setting of parameters, etc.) and the accuracywas computed on the remaining 10%. Note that the results of the stand-alonecomparison solely serve to illustrate the difference in performance observedindependently of an application in order to be able to assess the added valueof a more accurate PoS tagger in the WSD application.

6.2.1 Hidden Markov Model PoS Tagger

The first PoS tagger we used is the trigram Hidden Markov Model (HMM)tagger (Prins and van Noord, 2004) developed in the context of Alpino, a nat-ural language understanding system for Dutch (Bouma et al., 2001; van derBeek et al., 2002).2

In this standard trigram HMM, each state corresponds to the previous twoPoS tags and the probabilities are directly estimated from the labeled trainingcorpus (Manning and Schutze, 1999). There are two types of probabilitiesrelevant in this model, the probability of a tag given the preceding two tagsP (ti|ti−2ti−1) as well as the probability of a word given its tag P (wi|ti).

These probabilities are computed for each tag individually. Training theHMM with the forward-backward algorithm, we can calculate P (ti = t) forall potential tags:

1See table 6.2 for a distribution of the main PoS tag categories in the WSD data andthe Eindhoven corpus.

2See http://www.let.rug.nl/~vannoord/alp and chapter 7 for more information onAlpino.


P (ti = t) = αi(t)βi(t)

where αi(t) is the total (summed) probability of all paths that end at tag t

at position i, and βi(t) is the total probability of all paths starting at tag t

in position i continuing to the end. Comparing all the values for P (ti = t),unlikely tags are removed.

Smoothing of the trigram probabilities is achieved through a variant oflinear interpolation (Collins, 1999) where lower order (unigram) models arealso taken into account and weights are assigned to each of the models tocapture their relative importance.

Since the tagger’s lexicon has been created from the training data, thetest data very likely contains unknown words which means that no initial setof possible tags can be assigned to these words. Two different strategies havebeen incorporated in the HMM tagger used here. First, a heuristic rule forrecognizing names has been added which assigns an N tag to all capitalizedwords.3 Second, a set of automata (also created on the basis of the trainingdata) is used to find possible tags based on the suffixes of unknown words(Daciuk, 2000).

6.2.2 Memory-Based PoS Tagger

The second tagger we have used in the experiments reported here is theMemory-Based Tagger (MBT) (Daelemans et al., 2002a).4 It is a PoS taggerbased on Memory-Based Learning, an extension of the k-Nearest-Neighborapproach, which has proved to be successful for a number of languages andNLP applications (Zavrel and Daelemans, 1999; Veenstra et al., 2000; Hosteet al., 2002a).

MBT consists of two components: a memory-based learning componentand a performance component for similarity-based classification. During clas-sification, the similarity between a previously unseen test example and theexamples in memory is computed using a similarity metric. The category ofthe test example is then extrapolated based on the most similar example(s).

Given an annotated corpus, three data structures are automatically ex-tracted: a lexicon, a case base5 for known words, and a case base for un-known words. During tagging, each word is looked up in the lexicon and,if it is found, its lexical representation is retrieved and its context determ-ined. The resulting pattern is disambiguated using extrapolation from the

3Words in sentence initial position are decapitalized beforehand.4Freely available for research purposes at http://ilk.uvt.nl/software.html.5A case base is a collection of “cases” or prior observations.

6.2. Comparison of PoS Taggers 83

nearest neighbors in the known words case base. If a word is not present inthe lexicon, its lexical representation is computed on the basis of its form,its context is determined, and the resulting pattern is disambiguated usingextrapolation from nearest neighbors in the unknown words case base. Inboth cases, the output is a best guess of the category for the word in itscurrent context.

For the known words, the preceding two tags and words as well as theambiguous tag and word to the right of the current position have been usedto construct the known words case base. Classification was achieved usingthe IGTREE algorithm with one nearest neighbor. For unknown words, thepreceding tag, the ambiguous tag to the right, as well as the first and the lastthree letters of the ambiguous word itself were taken into account to constructthe unknown words case base. For classification, the IB1 algorithm with9 nearest neighbors was used. In both cases GainRatio feature weightingwas applied. For details on the different algorithms, IGTREE, IB1, andGainRatio, see Daelemans et al. (2002b).

6.2.3 Transformation-Based PoS Tagger

As the third member of the comparison, we used a Brill-style transformation-based tagger (TBL) (Brill, 1995) for Dutch (Drenth, 1997). The main com-ponents of a transformation-based tagger are a specification of admissibletransformations and a learning algorithm. Interdependencies between wordsand tags are modeled by starting out with an imperfect tagging which isgradually transformed into one with fewer errors. This is achieved by select-ing and sequencing transformation rules using the learning algorithm.

In an initial step, each word is assigned a tag independent of context. Aknown word is assigned its most likely tag determined by a maximum likeli-hood estimation from the training corpus. An unknown word, on the otherhand, is assigned a tag based on lexical rules learned during training. Allunknown words are initially tagged N. The application of lexical rules mod-ifies the tag (where necessary) based on the local properties of the unknownword, such as its suffix.

After each word has received an initial tag, contextual rules are appliedchanging the initial PoS tag (where necessary) based on the context of theword to be tagged (templates taken from (Brill, 1995)). The best contex-tual transformation rules and their order of application are selected by thelearning algorithm during training.

The present implementation of the TBL PoS tagger for Dutch uses around250 lexical rules and 300 contextual rules.


PoS Tagger Accuracy

TBL 94.20HMM 95.93MBT 96.21

Table 6.1: Stand-alone results (in %) for the three PoS taggers on 10% ofthe Eindhoven corpus data.

6.2.4 Stand-Alone Results for the PoS Taggers

As we have mentioned earlier, the stand-alone results for the PoS taggerswere computed using 80% of the Eindhoven Corpus (containing a total of760,000 words) for training and 10% for tuning. The accuracy shown intable 6.1 was computed on the remaining 10% of the corpus.

We can clearly see that the MBT tagger is performing best, followed bythe HMM tagger, the least accurate tagger being the TBL tagger.6 If thehypothesis that more accurate input to complex systems will produce moreaccurate results is correct, then these stand-alone results raise the expectationthat when applying all three taggers in our WSD system—with all othersettings being equal—accuracy should be highest when the MBT tagger wasused to tag the data. Performance is expected to decrease with the use ofthe HMM tagger and to be lowest for the TBL tagger.

This expectation might be invalidated by the (possible) corpus depend-ency of the three PoS taggers: the capacity to generalize from the trainingcorpus to the corpus to be tagged might be bigger in one tagger than inanother, which means that the results obtained in the complex system candiverge from the expectation raised by the stand-alone results. Also, it maybe that a tagger is more accurate than another but mainly on distinctionsthat are unimportant for our WSD application.

6.3 Integrating PoS Information

The WSD system used in these experiments is the same supervised corpus-based statistical classification algorithm described in section 4.2.1. Thissystem explores the intuition that (high quality) linguistic information isbeneficial for WSD. PoS is definitely one of the more accessible sources oflinguistic knowledge. The hypothesis behind comparing various PoS taggers

6All results differ significantly applying the paired sign test with a confidence level of95%.


PoS TBL HMM MBT Train. Corpus

N 19.46% 17.08% 17.37% 20.35%Punc 16.87% 17.17% 17.17% 12.69%V 15.04% 16.62% 16.66% 15.13%Pron 11.83% 11.88% 11.83% 9.82%Adv 9.58% 9.62% 9.53% 8.19%Art 8.08% 7.96% 7.95% 9.39%Prep 6.98% 7.26% 7.01% 10.54%Conj 5.72% 5.74% 5.77% 5.18%Adj 5.38% 5.63% 5.65% 6.53%Num 0.74% 0.61% 0.63% 1.78%Int 0.32% 0.47% 0.39% 0.18%Misc 0.003% 0.04% 0.04% 0.22%

Table 6.2: Frequencies of PoS tags assigned by each PoS tagger in the DutchSenseval-2 WSD data and distribution of PoS in the Eindhoven trainingcorpus.

in this application is that the quality of the PoS tags assigned to the datacan significantly influence the accuracy obtained by our WSD system.

In contrast to the English WSD data, the Dutch Senseval-2 WSD datais ambiguous with regard to PoS. This means that accurate PoS informationis even more important since the WSD system is supposed to do morpho-syntactic as well as semantic disambiguation.

For the two basic classifiers based on ambiguous word forms, the featureset contains the corresponding lemma as well as a context of three words tothe left and to the right of the ambiguous word. The context can either becomposed of word forms or lemmas. For the classifiers including PoS tags,we in addition include the PoS tags of the ambiguous word from the variousPoS taggers.


Before we turn to the actual results of using the different PoS taggers inour WSD system for Dutch, let us first compare the differences regardingthe assigned PoS tags. Table 6.2 shows the distribution of the different PoStags in the WSD data depending on the PoS tagger used, as well as thedistribution of the PoS tags in the training corpus.

A major difference between the distribution of PoS tags is that both the


Accuracy

baseline training data 75.64lemma, context words 83.32lemma, context lemmas 83.43

TBL HMM MBTlemma, pos, context words 83.53 83.63‡ 83.72†‡lemma, pos, context lemmas 83.67 83.76‡ 83.82†‡

Table 6.3: Results (in %) using leave-one-out on training data with a Gaus-sian prior of 1000, integrating the output from different PoS taggers; † de-notes a significant improvement over the results with the HMM tagger, ‡ de-notes a significant improvement over the results with the TBL tagger.

HMM and MBT tagger assign more V tags, whereas the TBL tagger assignsmore N tags. The preference for N tags in the TBL tagger can be explained bythe fact that all unknown words initially get tagged N. Also, in Dutch, verbalinfinitives have the same morphological suffix as plural nouns (-en). INT andMisc differ with all three taggers, but we could not detect any obvious reasonfor this. As we can see from table 6.2, there are bigger differences betweenthe TBL tagger and the other two, whereas the differences between the HMMand the MBT tagger are less noticeable.

The results in table 6.3 show the average accuracy on our training datausing leave-one-out as a test method with word forms as basis.7 As the tableof results shows, the WSD system performs well. The basic classifiers con-taining a minimum of information already achieves significantly better resultsthan the frequency baseline. Furthermore, adding (machine-generated) PoSas extra linguistic information—next to the lemma and the context alreadyincluded in the basic classifiers—increases results over the accuracy achievedwith a basic classifier.

This supports the underlying hypothesis behind the WSD system thatmore linguistic information is beneficial for WSD. Since the WSD data needsto be disambiguated morpho-syntactically as well as with regard to lexicalsemantic ambiguity, it is not surprising that adding PoS information achievesbetter results than only using the lemma and context.

Comparing the performance among the different PoS taggers, we can seequite clearly that our expectations are confirmed: the MBT tagger, which didbest in the stand-alone evaluation, is also working best in the WSD system.

7The results in this table differ slightly from the results presented in Gaustad (2003)since we added smoothing with Gaussian priors and did not use a bag of words approachfor the context in the experiments reported here.


w/o PoS with PoS

PoS ambiguous: 204lemma, context lemmas 84.63 85.36† (+0.73)non-PoS ambiguous: 749lemma, context lemmas 81.80 81.74 (−0.06)

Table 6.4: Comparison of accuracy with more than one PoS tag assigned bythe PoS tagger; † denotes a significant improvement over not including PoS.

This is the case for classifiers including context as word forms or as lemmas.8

We are conscious of the fact that the differences in accuracy are rather small,but since the models are very similar no big differences were expected. Thismeans that our hypothesis that highly accurate input influences the resultsof a complex system is verified: the most accurate PoS tags also produce themost accurate results when integrated into our WSD system.

In order to get a better picture of the effect of adding machine-generatedPoS information to our feature model, we proceeded to a more detailed eval-uation. In particular, we analyzed whether adding PoS information of theactual ambiguous word form improves the performance on PoS ambiguouswords. There are two ways in which a word can be PoS ambiguous: it caneither be (incorrectly) assigned more than one PoS tag by the tagger (PoSambiguity generated through the tagger) or it can really be PoS ambiguous.Since the second case is harder to verify, we started with the first and moreimportant one.

Extracting all word forms that are assigned more than one PoS tag bythe MBT PoS tagger, we retain 204 word forms of the total 953 (21.4%).Comparing the accuracy on these words we see clearly that adding PoS in-formation helps disambiguation (see table 6.4): An error rate reduction of4% is achieved when PoS information of the ambiguous word is included.

The accuracy of the remaining non-PoS ambiguous word forms decreasesslightly due to the fact that some noise is added through the PoS information,but the differences are rather small. We can also see these results in a positivelight: it means that if we add noisy features to the model, it does not havea big influence on the accuracy. In other words, the model is robust anddoes not assign high weights to useless features. We can conclude from this

8Applying the paired sign test with a confidence level of 95%, all results using MBTPoS tags were found to be statistically significantly better than results with other PoStags (and than the basic classifiers). The same is true of the HMM PoS tagger with regardto the TBL tagger and the basic classifiers, and the TBL tagger with regard to the basicclassifiers (see section 4.5 on the paired sign test).


Accuracy

baseline training data 75.64lemma, context words 83.32lemma, context lemmas 83.43

TBL HMM MBTlemma, pos, context words 83.53 83.63‡ 83.72†‡lemma, pos, context lemmas 83.67 83.76‡ 83.82†‡lemma, context words, pos in context 84.16 84.21 84.22lemma, context lemmas, pos in context 84.20 84.28 84.29lemma, pos, context words, pos in context 84.21 84.21 84.31†‡lemma, pos, context lemmas, pos in context 84.23 84.34‡ 84.36‡

Table 6.5: Results (in %) using leave-one-out on training data with a Gaus-sian prior of 1000, including PoS of the ambiguous word form and PoS ofcontext; † denotes a significant improvement over the results with the HMMtagger, ‡ denotes a significant improvement over the results with the TBLtagger.

more detailed analysis that PoS of the ambiguous word is definitely a usefulfeature in the case of the Dutch Senseval-2 data.

6.5 PoS Information in Context

As we have seen, adding PoS information of the ambiguous word significantlyincreases the accuracy achieved by making morpho-syntactic ambiguity dis-tinctions easier. In order to test the use of PoS on a greater scale, we alsoexamined the effect of including PoS for all context words or lemmas (seetable 6.5).

Using the same basic model as in section 6.4, the PoS of the context wasadded. One model contained the lemma of the ambiguous word, the contextand its PoS. This allowed us to assess the added value of using PoS for thecontext in comparison to only using the context alone. The other modelincluded the PoS of the ambiguous word along with the features containedin the first model to test the complementarity of the features.

As can be seen in table 6.5, adding the PoS of the context performsbetter than all the models tested so far, also the one including the PoS ofthe ambiguous word (repeated in table 6.5 to ease comparison). It seems tobe the case that adding (automatically generated) PoS of the context helpsdisambiguation by presenting more general information than context words

6.5. PoS Information in Context 89

or lemmas and therefore countering possible data sparseness. Also, PoS ofthe context can be seen as a very raw form of subcategorization information,indisputably a useful feature especially for verbs.

Combining both the PoS of the ambiguous word and the PoS of thecontext also has a positive effect on the accuracy of our classifiers, althoughthe increase in accuracy is not as big as with adding PoS of the context. Thisseems to suggest that combining features improves results.

Comparing the results achieved with the three different PoS taggers, wecan only conclude that the TBL tagger performs significantly worse.9 TheMBT tagger and the HMM tagger do not perform significantly differently,but for both we can see a significant increase in performance when addingPoS information of the context only or in combination with the PoS of theambiguous word.

The results presented in this chapter lead to the conclusion that includingPoS information for both the ambiguous word and the context in our featuremodel significantly improves the performance of the maximum entropy WSDsystem for Dutch over the model which does not. We mentioned two possibleoutcomes for the experiment where the PoS of the ambiguous word is addedto the feature set. On the one hand, performance could have stayed thesame because no new information was added to the model since the PoSinformation was already implicitly contained in other features, or, on theother hand, including PoS could lead to higher accuracy by adding newknowledge. As we have shown, the latter is the case. We therefore concludethat explicitly encoding information that helps disambiguation is better thanrelying on maximum entropy to filter out knowledge that is maybe implicitlypresent in the features used. The bias of maximum entropy does not seemto capture this information otherwise.

9Except for the model including the lemma, PoS of the ambiguous word, as well ascontext words where all three taggers do not perform significantly differently.


Chapter 7

Impact of Structural SyntacticInformation

Structural syntactic information can be of great importance for WSD. Thereis a general trend in NLP to start using deep linguistic processing insteadof shallow parsing with a number of applications (e.g. question answering,information retrieval). In WSD, various syntactic features have been used,the most important ones being labels from syntactically annotated trees anddependency relations. An important question we would like to investigateis whether structural syntactic information is helpful for lexical semanticdisambiguation.

Integrating deep syntactic processing into WSD taps into the potential toidentify syntactic and semantic dependencies that are beyond the possibilitiesof shallow parsers and simple context features. For instance, the Alpinoparser (which will be explained in detail in section 7.2.1) is able to processcoordinations, relative pronouns, and other “long-distance” and “control”dependencies. Shallow processing is typically insufficient to identify thesekinds of dependencies. Restricting the output of linguistic processing tosyntactic bracketing only results in the loss of important information.

For example, in the case of the verb worden ‘to become/get’, the twosenses that are distinguished in the Senseval-2 data are related to the func-tion of the verb: [worden hww] (become mainverb) and [worden kww](become copula). Information on the deep syntactic structure of the sen-tence in which worden occurs provides valuable clues for disambiguation. Ifworden is associated with a verbal complement, it is used as a main verb.If, on the other hand, worden is found with a predicate, it is the copula.Shallow processing does not provide this kind of detailed structural informa-tion. Typically long-distance dependencies are also difficult for PoS taggerswhich means that dependency relations can be useful in distinguishing sense

91

92 Chapter 7. Impact of Structural Syntactic Information

ambiguities related to the function of an ambiguous word.

Attempting to achieve a full semantic analysis, on the other hand, may betoo ambitious for current parsing systems. Dependency relations provide amiddle ground as they typically include only those aspects of syntax relevantfor applications like WSD, and at the same time can be computed morerobustly than abstract semantic representations. Moreover, the psychologicalplausibility of dependency relations has been attested in Hudson (2003), anadditional reason to prefer this type of syntactic knowledge over bracketingor tree labels.

We will first introduce a number of WSD systems which have employedsyntactic information in the past (section 7.1). Next, we will explain depend-ency relations in general, the Alpino dependency parser, and the dependencyfeatures used in our WSD system (section 7.2). In section 7.3, the differ-ent models including dependency relations and the respective results will bepresented and evaluated.

7.1 Prior Work

Li et al. (1995) describe a WSD algorithm which uses WordNet and theresults of surface-syntactic analyses in order to minimize the need for otherknowledge sources in the WSD process. Disambiguation is achieved by com-puting the semantic similarity between words and applying heuristic rulesbased on this similarity. Their WSD system is designed to be incorporatedin a larger project on learning from textual data and is only applied to nounsin object position (but could be extended to nouns in other positions).

Semantic similarity between words is defined as “inversely proportional tothe semantic distance between words in WordNet IS-A hierarchy” (hyponym-hypernym hierarchy). Li et al. distinguish four levels of semantic similarity:

1. Strict synonyms, i.e. a word is in the same synset

2. Extended synonyms, i.e. a word is the immediate parent node

3. Hyponyms, i.e. a word is a child node

4. Coordinate relationship, i.e. a word is a sibling node

Verbs that dominate noun objects in a sentence provide the context fordisambiguation. For example, the verb calculate can help to find the mon-etary sense for the noun contribution or the verb sell excludes all senses of

7.1. Prior Work 93

the noun property that do not refer to material belongings.1 These verb-object pairs are extracted from the data and used to determine the meaningof nouns within a sentence. Using the semantic similarity of the ambiguousnoun with a particular sense in WordNet a final choice is made. Further-more, special heuristic rules related to the syntactic structure of a particularsentence are being used to aid disambiguation by acquiring more informa-tion on the ambiguous noun, e.g. searching for “such as” and coordinationconstructions.

The WSD algorithm itself works in eight steps, considering heuristics withhigh semantic similarity between words earliest. If any step succeeds, theremaining ones are skipped. First, it is tested whether a noun has only onesense in WordNet. If this is the case, that sense is assigned with a confidenceof 1. A noun that has a synonym or hyponym in WordNet corresponding to aparticular sense is given that sense with a confidence of 0.9. So, in the verb-object pair “calculate contribution”, contribution can be disambiguated withthe help of the synonym amount which corresponds to the monetary senseof contribution. The same mechanism applies if a noun is in a coordinaterelationship instead of synonymy or hyponymy (confidence = 0.8).

If a noun has already been assigned a particular sense but with a differ-ent verb and if that verb is a synonym, hyponym or coordinate of the otherverb, the same sense is chosen (confidence = 0.7). To disambiguate the pair“prorate contribution” for example, we can take advantage of the fact thatprorate is a synonym of calculate and that the meaning of contribution in“calculate contribution” has already been acquired. From that we can inferthat contribution in “prorate contribution” also means ‘amount of money’.If both the verb and the noun are a synonym, hyponym and a coordinate ofanother noun-verb pair, the corresponding senses are selected with a confid-ence of 0.6 (in case the verb is the coordinate) or with a confidence of 0.5 (ifthe noun is the coordinate). As a last step, particular syntactic constructionsare searched for.

The algorithm works fairly well, but some verb contexts are not strongenough to limit the possible senses of their noun objects to the only intendedand correct sense. Also, in some cases the meaning obtained by the algorithmis suitable in the verbal context considered, but is not appropriate for thewhole text.

Lin (1997) explores the fact that “two different words are likely to havesimilar meanings if they occur in identical local contexts” (p. 64, emphasismine). This means that the same knowledge sources are used for all wordsand that instead of building separate classifiers for each word, past usages

1All examples taken from Li et al. (1995).


of other words are used to disambiguate the current word. The advantageof this idea lies in the fact that no large sense-tagged corpus is needed andinfrequent words or words that do not occur in the corpus can be treated aswell.

The required resources for his algorithm are an untagged corpus, a broad-coverage parser, a concept hierarchy (WordNet), and a similarity measurebetween concepts. With the help of the broad-coverage parser, local con-texts are extracted from a corpus, where “local context” means the syntacticdependencies between words in a sentence. These are stored as depend-ency triples containing the type of dependency relation, the word related tothe ambiguous word via the dependency relationship, and information onwhether the word is the head or the modifier (dependent). For instance,in the sentence “The cat chased the mouse”, a local context of cat wouldbe [subj chase head] whereas a local context of mouse would be [obj1

chase dep].

Furthermore, triples are constructed which contain a word, the frequencyof how often that word occurred in the particular local context it is associatedwith and the likelihood ratio of the context and the word. An example of sucha triple looks like this: [cat 10 6.32], meaning that the word cat occurs 10times in the local context [subj chase head] and the bigram consisting ofcat and [subj chase head] has the likelihood ratio 6.32. Using these twokinds of triples, a Local Context Database of pairs of triples is built. Thesimilarity between two concepts in WordNet is derived from an information-theoretic notion of similarity and is a function of the commonality and thedifference between the concepts.

Words are disambiguated as follows: First, the input is parsed and thelocal contexts are extracted. Then, the local context database is searchedto find words that appear in the same local contexts as the target word.These are called selectors of the target word. Now, a sense s of word w

is selected that maximizes the similarity between w and its selectors. Thissense is then assigned to all occurrences of w in the text. This last stepprobably overgenerates (definitely with one large corpus where documentsare not separated).

Lin (2000) presents the same algorithm using the Hector lexicon insteadof WordNet.2 The main conclusion is that defining the local context in termsof dependency relations instead of as surrounding words (as is traditionallydone) gives better results, especially when the size of the training set is verysmall. Since our approach is not unsupervised and does not use WordNet-like concepts to define senses, it is difficult to compare the two systems.

2The Hector lexicon was used as a sense inventory during Senseval-1.

7.1. Prior Work 95

It seems, though, that deep syntactic analysis in the form of dependencyrelations could be beneficial for WSD—just as Lin concludes.

Stetina et al. (1998) present a system which uses syntactic informationfrom a treebank in combination with WordNet senses annotated in SemCor.Their approach tries to identify all content words in a given sentence basedon an estimation of the overall probability of all semantic relations in thatsentence. They only use the syntactic relations between the head of a par-ticular phrase and a modifier and assume that the relations within the samephrase are independent.

In a first step, semantic relations and the probability for each combin-ation of the senses of the arguments in a given relation are learned. In asecond step, disambiguation is achieved through evaluating the probabilitiesof the relations. Their reasoning about reciprocal dependencies resemblesHawkins’s (1999), but their solution is quite different (and involves syntacticinformation which Hawkins (1999) does not use): they apply hierarchicaldisambiguation based on similarity measures between the tested and thetraining relations consisting of bottom-up word sense score propagation andtop-down disambiguation.

During bottom-up sense propagation, only the head word is propagatedto participate in semantic relations with concepts of other levels of the parsetree, but its senses are restrained by the modifiers present. The word suit,for example, has six senses in WordNet, but in the combination with themodifier silk, the number of possible senses can be restricted to two. Atthis stage, only the sentence head is disambiguated, but based on the wholesyntactic structure of the current sentence. Then, top-down disambiguationstarts. Only the senses which are possible vis-a-vis the sense of the sentencehead that has been chosen are considered for disambiguation.

Stetina et al. evaluate their approach on 15 SemCor files, using theremaining files for training. Compared to the frequency baseline, their systemachieves better results (exceeding 80%). Overall, their approach solves thesparse data problem by using the semantic relations in a sentence as context.This means that no exact match between the training examples and thedisambiguated words is necessary. Also, they can distinguish many wordsenses and all PoS.

Martınez et al. (2002) discuss the contribution of various syntactic fea-tures together with basic local and topical features to WSD. The basic topicalfeatures included were open-class lemmas (either 4 lemmas around the tar-get word or all lemmas in the sentence plus the two previous and the twofollowing sentences). Basic local features are bigrams and trigrams includingthe target word.

The set of syntactic features was extracted using Minipar (Lin, 1993).


They distinguish direct (directly linked words in the parse tree) and indirectrelations (two or more dependencies between words in the tree) and for eachrelation store also its inverse. The syntactic relations taken into account wereinstantiated grammatical relations (IGR) and grammatical relations (GR).IGR are triples of [wordsense relation value] where the value is eitherthe word form or the lemma. An example of an IGR for the sentence “The catchased the mouse” where mouse can have two senses—the ‘animal’ and the‘computer tool’—would look as follows: [mouse animal obj chased]. ForGR, bigrams [wordsense relation] and n-grams [wordsense relation1

relation2...] are stored. These n-grams are similar to verbal subcat-egorization frames and are only used for verbs. Three types of n-grams aredefined: all subcategorization information including PoS given by Minipar,subcategorization information occurring in the sentence, and all dependen-cies in the parse tree.

Their results on the Senseval-2 lexical sample data for English indicatethat both types of syntactic relations, IGR and GR, together provide the bestF-score (IGR getting better precision than coverage and GR lower precision,but higher coverage). Combining all syntactic features with the basic localand topical features significantly improves performance of the AdaBoost al-gorithm, a general method for disambiguation linearly combining many weakclassifiers (Freund and Schapire, 1997), over only using the basic features,showing that basic and syntactic features are complementary. Decision listsdo not seem to profit from the addition of syntactic features, probably due tothe conservative nature of the algorithm which always chooses only the bestfeature and does not combine them. The overall conclusion is that “syntacticfeatures effectively contribute to WSD precision”.

7.2 Dependency Relations

Dependency structures make explicit the grammatical relations between con-stituents in a sentence. Each non-terminal node in such a structure consistsof a head-daughter and a list of non-head daughters, whose dependency re-lation to the head is marked. (Note that a dependency structure does notnecessarily reflect (surface) syntactic constituency.) On the one hand, de-pendency structures are more abstract than syntactic trees (e.g. word orderis not expressed), and on the other hand, they are more explicit with regardto the dependency relations encoded. Co-indexing is used to express multiple(possibly different) dependency relations between constituents. The depend-ency structures used here have been developed in the context of the Corpus

7.2. Dependency Relations 97

topsmain

sunp

detdet

een0

modadj

oorverdovend1

hdnoun

donderslag2

hdverbdoe3

obj11

np

detdetde4

hdnoun

aarde5

vcinf

su1

hdverbbeef6

Figure 7.1: Dependency structure of the sentence Een oorverdovende don-derslag deed de aarde beven.

Spoken Dutch (CGN) project3 and are described in more detail in Moortgatet al. (2002). See figure 7.1 for an example of a dependency structure of thesentence in (1).

(1) Een oorverdovende donderslag deed de aarde beven.(A deafening thunderclap made the earth tremble.)

Recently, there has been a growing interest in more theory neutral an-notation schemes based on dependency relations (Carroll et al., 1998a), incontrast to parsers trained on treebank structures. An important reason forthis is that for languages with a free(r) word order, traditional parse treesonly reflect the surface order whereas dependency relations provide more in-sightful information. Also, comparison between sets of dependency relationsis easier than between parse trees. Dependency structures or trees, further-more, provide a reasonably abstract and theory-neutral level of linguisticrepresentation.

In order to acquire the dependency relations for our WSD corpus, weparsed the corpus with the Alpino dependency parser which we will nowexplain in more detail.

7.2.1 Alpino Dependency Parser

The Alpino dependency parser (Bouma et al., 2001; van der Beek et al., 2002)is a wide-coverage and robust computational parser of Dutch.4 It consists

3See the CGN website for details http://lands.let.kun.nl/cgn/ehome.htm.4Alpino is being developed as part of the NWO Pionier Project Algorithms for Lin-

guistic Processing. See http://www.let.rug.nl/~vannoord/alp.


of a large lexicon (in part derived from Celex (Baayen et al., 1993) andParole5), a hand-written grammar, and statistical disambiguation modulestrained on both annotated (over 7,000 manually annotated sentences) andunannotated data (several years of newspaper text).

The head-driven lexicalized grammar performs a full analysis of the input,and produces dependency structures (see above for a brief explanation ofdependency structures and Bouma et al. (2001) for a complete description ofthe architecture of Alpino). The dependency labels assigned by the parserare adopted from the syntactic annotation guidelines (Moortgat et al., 2002)of the CGN project.

The Alpino parser currently identifies dependency relations with an accur-acy of approximately 85%. This compares well with results reported for iden-tification of dependency labels in English text. Briscoe and Carroll (2002)and Carroll et al. (1998b) report accuracies between 76% and 83%. Note,however, that they use an annotation scheme which distinguishes only about15 different labels, whereas the Alpino/CGN scheme uses 30 different labels,which can make the task of correctly identifying a label harder.

The Alpino parser is also robust and relatively fast. Methods for dealingwith ungrammatical input for parsing speech recognizer output (van Noord,2001) were incorporated into the parser. This allows partial results to bereturned in cases where a full parse fails. To parse unrestricted text, heurist-ics have been implemented which guess the syntactic properties of unknownwords. Although parsing with a feature-based grammar of the kind usedin Alpino remains computationally expensive, considerable improvements inefficiency have been achieved by including a PoS tagger which filters unlikelytags suggested by lexical lookup (Prins and van Noord, 2001). The systemis fast enough to make parsing of large corpora (up to 300 million words)practically feasible.

7.2.2 Dependency Triples as Features

For the experiments in this chapter, the data was parsed by Alpino, resultingin a list of dependency triples. These dependency triples consist of the head,the non-head and the dependency relation that holds between them. For eachof these triples, the syntactic relation, e.g. su (subject), obj1 (direct object),mod (modifier), is linked to the respective head word and non-head word.For more complex sentences in our data set, up to 6 different dependencyrelations have been linked to one word. See table 7.1 for an illustration ofhow the dependency triples are encoded.

5See http://www.inl.nl/corp.parole.htm.

7.2. Dependency Relations 99

1 〈 deed, su, donderslag 〉 5 〈 donderslag, det, een 〉2 〈 donderslag, mod, oorverdovende 〉 6 〈 aarde, det, de 〉3 〈 deed, obj1, aarde 〉 7 〈 deed, vc, beven 〉4 〈 beven, su, aarde 〉

Table 7.1: Dependency triples associated with the dependency tree repres-ented in figure 7.1; numbering only given for reference, no order implied.

# head relations only1 41,6442 1,1903 18Total 42,852# dependent relations only1 4,9212 4,2563 2,4474 6535 506 1Total 12,328# both head & dep. rels 34,798

Grand Total 89,978

Table 7.2: Frequency of dependency relations linked to a single word in theDutch Senseval-2 training data.

In total, 89,978 word forms (of a total of 97,178) are annotated with headand/or dependency relations. The remainder of the word forms included con-structions without a head, such as elliptical structures. These dependencyrelations were not taken into account in our system. Regarding the ambigu-ous words found in the corpus, 926 out of 953 classifiers include informationon dependency relations. Table 7.2 gives a detailed overview of how manydependency relations have been found and annotated in our WSD trainingdata.

Only the triples relating to the ambiguous word form being disambiguated(in this example aarde) are contained in the feature vector. In the firstexperiments, we included the relation the ambiguous word form entered,keeping relations where the ambiguous word was the head of the relation a


separate feature from where it was the dependent of the relation. We onlyincluded the dependency relation labels (e.g. su or obj1) without takinginto account the word completing the triple (e.g. what aarde is a subject ordirect object of). Example (2) shows what the feature vector for sentence (1)including dependency relations looks like.

(2) aarde N det su/obj1 donderslag doen de beven . =

aarde planeet

As before, the first slot represents the lemma of the ambiguous word, thesecond its PoS. The third slot contains all dependency relation labels of whichaarde is the head (dependency triple 6 in table 7.1), whereas slot four containsall dependency relations of which aarde is the dependent (dependency triples3 and 4 in table 7.1). The rest of the feature vector includes the contextlemmas and the class of the instance. We will refer to the feature includingthe dependency labels of which the ambiguous word is the head as “head”,and as “dep” where the ambiguous word is the dependent.

Lin defined local context as the combination of the dependency relationand the words entering a particular relation. So, as an extension of thefirst variant, we also included the non-ambiguous word associated with aparticular relation in a dependency triple as extra information, referred to as“head+w” and “dep+w” respectively. In this case the feature vector lookslike example (3).

(3) aarde N det de su beef/obj1 doe donderslag doen de beven

. = aarde planeet

On the one hand, the data will be sparser when all the information from thedependency triple is included as a feature. On the other hand, it might alsobe the case that the more detailed information provides a better clue for dis-ambiguation, similar to the conclusion drawn in section 4.6 in chapter 4 withregard to the importance of the position of context relative to the ambiguousword.


We have already introduced the notion of “bag of words” in section 4.3.Since an ambiguous word can have a varying number of dependency relationsassociated with a given sentence and since there is no positional issue as wasthe case with context, we treated dependency triples as a “bag” (withoutadditional feature selection). Table 7.3 shows the results when including the


Data ambiguous allSyntactic structure added – head, – head,

dep dep

baseline training data 75.64 86.15– – 81.32 – 89.38lemma – 83.37 – 90.53lemma, pos – 83.73 – 90.74lemma, context lemmas 83.43 85.96† 90.56 92.00†lemma, pos, context lemmas 83.82 86.08† 90.79 92.07†lemma, pos, con. lemmas, pos in con. 84.36 85.80† 91.10 91.91†

Table 7.3: Results (in %) using leave-one-out on training data with a Gaus-sian prior of 1000, including dependency relations; † denotes a significantimprovement over the results without dependency relations.

dependency relations as linguistic features.

As we can see from the results in table 7.3, including dependency rela-tions as sole feature already performs significantly better than the baselineclassifier which proves that deep syntactic knowledge provides valuable in-formation for disambiguation.6 Including the lemma along with dependencyrelations improves performance, and adding the PoS of the ambiguous wordeven outperforms the feature model using the lemma of the ambiguous wordand context lemmas. This means that dependency relations as features areusable as well as useful for WSD.

The addition of context lemmas leads to a gain in accuracy of 2.23%which corresponds to a very high increase compared to earlier results. Thebest results at 86.08% are achieved including the lemma, the PoS as well asthe dependency relations linked to the ambiguous words in combination withthe context lemmas.

Interestingly enough, if the PoS of the context lemmas are added as anextra feature, performance of our system decreases. A possible explanationfor this phenomenon is that that PoS of the context encode similar inform-ation to dependency relation labels, but are less detailed and less precise,and therefore the two features seem to interact (indirectly via the assignedweights) with one another. If one of the two features is more sparse than the

6Note that not all ambiguous words are annotated with dependency triples (926 out of953 classifiers provide results when only using deep syntactic features) which means that27 classifiers do not output any results when only including dependency relation labels asfeatures.


Data ambiguous allSyntactic structure added head, head+w, head, head+w,

dep dep+w dep dep+w

baseline training data 75.64 86.15– 81.32† 73.92 89.38† 85.17lemma 83.37† 82.24 90.53† 89.89lemma, pos 83.73† 83.24 90.74† 90.46lemma, context lemmas 85.96† 84.55 92.00† 91.20lemma, pos, context lemmas 86.08† 84.74 92.07† 91.31lemma, pos, con. lem., pos in con. 85.80† 85.13 91.91† 91.53

Table 7.4: Results (in %) using leave-one-out on training data with a Gaus-sian prior of 1000, using dependency relations including words; † denotes asignificant improvement over the results with dependency relations includingwords.

other, however, (as we will see in the discussion of table 7.4), this interactionno longer occurs. The conclusion that can be drawn from this result and itsexplanation is that if no syntactic information can be obtained, the use ofPoS of the context can act as a (albeit poorer) substitute. Otherwise, de-pendency relations subsume preciser and more informative knowledge and,therefore, lead to better accuracy.

In table 7.4, we show the results using dependency relations includingall words involved in the relation. So, as we have explained in example (3),instead of only including the name of the relation itself, we use a combinationof the relation name and the word completing the dependency triple togetherwith the ambiguous word. We compare the results to the accuracy achievedwhen using only the dependency relation labels (already presented in table 7.3and repeated here for convenience).

As we can see in table 7.4 adding the dependency relation in combinationwith the word completing the triple leads to poorer results than using only thename of the dependency relations. Especially when we compare the modelonly including the head and dependent relations as feature, it becomes clearthat the main problem is data sparseness: the extension of the relation nameswith words results in many sparse features that are not seen often enoughto reliably estimate the weights associated with them.7 A possible solution

7The combination of data sparseness with the fact that only 926 classifiers actuallycontained information on dependency relations probably accentuates the drop in the per-formance.


to try in the future would be to include dependency relations both with andwithout words as features.

Despite the lower results with added words, all models including relationnames and words still significantly outperform the models without depend-ency relations. However, with these settings for the dependency relations,the combination of all features works best. It seems to be the case that thesparseness of the combined feature containing the dependency relation labeland the word is successfully counteracted by the information contained inthe PoS of the context. This means that if the syntactic information andthe PoS information we are combining are both very general, only the moreinformative of the two should be used. Otherwise, if one of the two sources ofinformation is general and the other one rather sparse, both should be usedin combination.

Summarizing the findings of this chapter, we can say that the additionof deep linguistic knowledge to a statistical WSD system for Dutch resultsin a significant rise in disambiguation accuracy compared with all resultsdiscussed so far. It is especially interesting to report that using dependencyrelations in conjunction with the lemma of the ambiguous word and its PoS asfeatures performs better than the model including only immediately left andright context. We can therefore conclude that dependency relations containa lot of valuable clues for disambiguation.

Furthermore, our results clearly show that PoS of the context and de-pendency relation labels provide similar information and that if they areequally sparse, they should not be used together. Turning this fact around,we think that PoS can actually act as a substitute for general deep linguisticknowledge. Obviously, its information content is poorer and less specific, butif e.g. no parser is available for a language, PoS of the context might proveto be a valuable feature for WSD. If the deep linguistic feature is sparse,however, a more general feature containing similar information, such as PoSof the context, can be used to counteract the data sparseness.


Chapter 8

Final Results on DutchSenseval-2 Test Data

The general idea of testing is to assess how well a given model works andthat can only be done properly on data that has not been seen before. Su-pervised models often have a tendency to be overtrained, i.e. they expectfuture events to resemble training events and are not able to generalize wellto new data. Therefore, it is essential to test on different data to determinethe real performance of a system.

The main goal of the research reported on in this thesis is to test varioussources of linguistic knowledge on their value for WSD, independently and incombination. Consequently, many different feature models have been tested.In order to avoid over-using the test data by testing each feature model onit, it was necessary to evaluate the different kinds of linguistic knowledge ina tuning scheme first. So far, most of our results (with the exception of theresults presented in chapter 5) have been produced using the leave-one-outapproach described in section 4.5. Now the best feature model in the tuningsetup has been determined, we can proceed to evaluate our WSD systemfor Dutch on the (unseen) test data. These final results will also allow usto validate the conclusions we have drawn from the results on the tuningdata and to compare the accuracy of our WSD system with other publishedresults.

We will first summarize the findings on the tuning data presented so far(section 8.1). Then, in section 8.2, the settings we used for the test run willbe explained. Moreover, the final results on the test data will be presentedand evaluated with respect to earlier results, results on the training data andother WSD systems for Dutch.

105

106 Chapter 8. Final Results on Dutch Senseval-2 Test Data

8.1 Summary of Findings on Tuning Data

In chapter 3 we have shown that the widely used technique of pseudowords toalleviate the need for hand annotated sense-tagged data is not a viable substi-tute for real ambiguous words. The main reason for this is that the “senses”of pseudowords consist of two (or more) clearly distinct words whereas realambiguous words usually have senses and subsenses that can be closely re-lated and are therefore more difficult to identify correctly, even for humans.

Then the experimental setup of the supervised corpus-based WSD sys-tem was introduced in chapter 4, including a presentation of the corpus, theclassification algorithm used for disambiguation, as well as its implementa-tion. We also presented first results with only “basic” features, such as thecontext surrounding the ambiguous word and its lemma. From these res-ults, we could conclude that maximum entropy works well as a classificationalgorithm for WSD when compared to the frequency baseline.

We additionally ran several experiments taking into account these basicfeatures to decide which settings could best be used when more kinds oflinguistic knowledge were included. It was investigated whether it was be-neficial to use a frequency threshold with regard to the number of traininginstances of each ambiguous word found in the corpus. Our results show thatmaximum entropy (in combination with smoothing using Gaussian priors) isrobust enough to deal with infrequent data and for this reason no thresholdwas applied. Moreover, various context sizes have been tested (only takinginto account the context words contained in the same sentence as the am-biguous word). We have found that a context of three words to the right andthe left perform better than bigger context sizes, confirming earlier findingsin the WSD literature. The last important result from chapter 4 is that usingcontext lemmas for generalization in combination with the relative positionof the context to the ambiguous word achieves better accuracy than contextwords and/or treating the context as a bag of words.

After the presentation of our WSD system for Dutch and the experi-mental setup, chapter 5 introduced a novel approach to building classifiersand, at the same time, included the first type of linguistic knowledge we in-vestigated, namely morphological information. The lemma-based approachuses the advantage of more concise and more generalizable information con-tained in lemmas as key feature: classifiers for individual ambiguous itemsare built on the basis of their lemmas, instead of word forms as has tradition-ally been done. Lemmatization allows for more compact and generalizabledata by clustering all inflected forms of an ambiguous word together. Themore inflection in a language, the more lemmatization will help to compressand generalize the data. Therefore, more training material is available to

8.1. Summary of Findings on Tuning Data 107

each classifier and the resulting WSD system is smaller and more robust.Our comparison of the lemma-based approach with the traditional word

form approach on the Dutch Senseval-2 test data set clearly showed thatusing lemmatization significantly improves accuracy. Also, in comparisonto earlier results with a Memory-Based WSD system, the lemma-based ap-proach performs equally well when using the same features, involving lesswork (no parameter optimization).

A second source of linguistic information that has been tested for itsvalue for WSD is PoS (chapter 6). The PoS of an ambiguous word itselfpresented important information because the Dutch Senseval-2 data hadto be disambiguated morpho-syntactically as well as with regard to meaning.Two hypotheses were tested. On the one hand, it was investigated whateffect the quality of the PoS tagger used to tag the data had on the resultsof the WSD system including PoS information. The results confirmed theexpectation that the most accurate PoS tagger (on a stand-alone task) alsooutperforms less accurate taggers in the application-oriented evaluation inour WSD system for Dutch. On the other hand, the experiments conductedallowed us to test whether adding features explicitly encoding certain typesof knowledge increased disambiguation accuracy. Our results show that thisis definitely the case.

We not only included the PoS of the ambiguous words, but also addedthe PoS of the context as an extra feature. Both sources of knowledge led tosignificant improvements in the performance of the maximum entropy WSDsystem.

The third kind of information and second kind of syntactic knowledge thathas been included are dependency relations (chapter 7). This implicitly testswhether deep linguistic knowledge is beneficial for a WSD application. Afteran overview of previous research in WSD using syntactic information, weintroduced dependency relations and their merit for NLP, as well as Alpino,the dependency parser which was used to annotate the data. Two differentkinds of features including dependency relations were experimented with: onthe one hand, two features containing the name of all relations of which agiven ambiguous word is the head or the dependent, respectively, and, onthe other hand, the same two features but with the name of the relationcombined with the word completing the dependency triple.

The results in chapter 7 show that the addition of deep linguistic know-ledge to a statistical WSD system for Dutch results in a significant rise in dis-ambiguation accuracy compared with all results discussed so far. Dependencyrelations on their own already perform significantly better than the baseline,the combination of the lemma and PoS of the ambiguous word together withdependency relations even outperforming the model using context informa-


Data ambiguous allData section tune test tune test

baseline 75.64 78.47 86.15 89.44lemma, pos, con. lemmas, pos in con. 84.36 83.66 91.10 92.37lemma, pos, head, dep, con. lemmas 86.08† 84.78† 92.07† 93.18†

Table 8.1: Results (in %) on the tuning and test data; † denotes a significantimprovement over the model including PoS in context (to be read vertically).

tion. The best results (on the tuning data) at 86.08% are achieved includingthe lemma, the PoS as well as the dependency relations linked to the am-biguous words in combination with the context lemmas.


As we have already mentioned above, the combination of (carefully selec-ted) linguistic features performs best on the tuning data. Especially deeplinguistic knowledge in the form of dependency relation names significantlyincreases accuracy. The results on the Senseval test data for Dutch weretherefore computed using the lemma of the ambiguous word, its PoS, depend-ency relation names, as well as the context lemmas as features. Following theresults presented in section 4.6, we did not use a threshold on the numberof training instances of a particular ambiguous word and kept an orderedcontext of three words to the left and to the right of an ambiguous word.

The results in table 8.1 confirm our findings on the tuning data that max-imum entropy classification works well for WSD of Dutch and significantlyoutperforms the frequency baseline. The results with the best settings de-termined during tuning also work best on the test data in comparison to thesettings used in chapter 5. An error-rate reduction of 8% can be observedwhen adding structural syntactic information in the form of dependency re-lations instead of PoS of the context.

If we compare our results on the tuning and on the test data (also intable 8.1), several aspects are worth mentioning. During training, 953 classi-fiers were built. There are 512 unique ambiguous words in the test data andfor 410 of them training data and a trained classifier exist. This means that102 unique ambiguous words were seen for the first time and where assigneda random guess. In the case of our tuning data, all tested instances had atrained classifier since we used a leave-one-out approach. Based on this fact,the results on the test data are expected to be lower than the results on the


Data ambiguous allApproach word lemma word lemma

baseline test data 78.47 89.44lemma, pos, con. lemmas, pos in con. 83.66 84.15† 92.37 92.45†lemma, pos, head, dep., con. lemmas 84.78 85.74† 92.81 93.37†

Table 8.2: Comparison of results (in %) on the test section of the DutchSenseval-2 data with the word form and the lemma-based approach; † de-notes a significant improvement over the word form approach.

tuning data.

Notwithstanding words without trained classifier, the ranking of the ac-curacy achieved with the different feature settings remains the same on thetest data: deep linguistic knowledge still outperforms shallower PoS inform-ation of the context. Due to the fact that fewer instances are being classifiedduring testing, the difference in performance between the feature models isalso less pronounced (1.12%) than during leave-one-out tuning on the train-ing data (1.72%). The difference between the two feature models on the testand tuning data is statistically significant, however, using a paired sign testwith a confidence level of 95%.

In chapter 5, we have presented results on the test data using the lemma-based approach introduced in the same chapter. We will now also give newresults with the lemma-based approach using the feature model which workedbest with the word form-based approach, namely including dependency re-lations. This allows us to verify whether the lemma-based approach outper-forms classifiers built on the basis of word forms (with the same settings).The results on the test data presented in chapter 5 will be repeated, as well,which enables us to compare the lemma-based approach on two differentfeature models.

As we can see in table 8.2, the lemma-based approach outperforms theword form-based approach independently of the features included in themodel. Also, the best overall performance on the test data is achieved usingthe lemma-based approach with the feature model including information onthe PoS of the ambiguous word form/lemma, its dependency relation labels,as well as the context lemmas. We can observe an error rate reduction of10% with regard to the lemma-based model including PoS in context, and areduction of 6% of errors with regard to the best model based on word forms.

In table 8.3, we present a comparison with another system for Dutchalready described in section 5.5. The results published in Hendrickx et al.


ambiguous all

baseline on test data 78.5 89.4word form-based classifiers 84.8 92.8lemma-based classifiers 85.7 93.4(Hendrickx et al., 2002) 84.0 92.5

Table 8.3: Comparison of results (in %) from different systems on the testsection of the Dutch Senseval-2 data.

(2002) are the best results for Dutch until now.1

As we can see, both the word form-based classifiers and the lemma-basedclassifiers produce higher accuracy than the results from the MBL systemby Hendrickx et al. (2002). We think that this is mainly due to the factthat our feature model includes deep linguistic information in the form ofdependency relations whereas they include PoS of the context. From ourcomparison of results on the tuning data (see chapter 7), we conclude thatdependency relations provide more useful clues for disambiguation than PoSof the context surrounding an ambiguous word. The same conclusion can bedrawn with regard to the results on the test data and even with regard totwo different ML algorithms, namely maximum entropy and MBL.

The lemma-based model actually leads to an error rate reduction of 10%if compared to the MBL WSD system. Our maximum entropy system isthus state-of-the-art for Dutch word sense disambiguation, showing that thecombination of building classifiers based on lemmas instead of word formsand including dependency relation labels as linguistic features (along withcontext lemmas) works best.

To conclude, it is important to mention that what has already been shownin the leave-one-out tuning setup can also be found in the results on the testdata. Clearly a combination of different sources of linguistic knowledge leadsto the best results. Especially the addition of deep linguistic knowledgein the form of dependency relations in combination with building classifiersbased on lemmas instead of word forms greatly improves accuracy over earlierresults. Future work will include a more detailed error analysis of our resultswith dependency relations as features with respect to e.g. trends in accuracyon different PoS or ambiguous words with (very) skewed sense distributions.

1Unfortunately, there has been very little interest in Dutch WSD and therefore verylittle results to compare our approach to.

Chapter 9

Conclusions and Future Work

We will now proceed to the general conclusions that can be reached from theresearch on Dutch WSD presented in this thesis. Even though our exper-iments were conducted on Dutch, we are convinced that some if not all ofour results will also be of influence for further research on other languages.Therefore, the content of this chapter should be understood in the contextof WSD in general. In section 9.1 our main findings are discussed in moredetail. Ideas for future work and applications of WSD (section 9.2) concludethis chapter and this thesis.

9.1 Conclusions

The project’s main goal was to develop a word sense disambiguation systemfor Dutch which automatically determines the meaning of a particular (am-biguous) word in a given context. In this, we succeeded. The main questionwe have tried to answer in the current thesis was which kind of linguisticknowledge is most useful for word sense disambiguation of Dutch. The res-ults from our research suggest that in the case of a statistical disambiguationalgorithm the combination of linguistic features yields the best results.

Several remarks are in order to accompany the answer to our initial ques-tion if and to what extent linguistic knowledge improves WSD for Dutchand which information sources prove to be most successful. First of all, wehave clearly shown that the addition of linguistic knowledge greatly improvesa statistical WSD classification system (with regard to the baseline), for arange of knowledge types. Also, as has been shown in chapter 7 and 8, thecombination of several orthogonal sources of linguistic features yields thebest results. This means that WSD for Dutch profits from various sources oflinguistic knowledge.

111

112 Chapter 9. Conclusions and Future Work

Thus, there is not a single best linguistic knowledge source, but rather anumber of (carefully) selected features that work best in combination. It isimportant, however, that the features that are used together in a statisticalmodel do not represent similar information and, at the same time, are equallysparse.

We found that especially deep linguistic information contains importantcues for disambiguation. In combination with an approach taking advantageof morphological information, the lemma-based approach, the best resultsfor WSD of Dutch on the Senseval-2 data set are obtained. Our systemachieves significantly higher disambiguation accuracy than the (few) resultsfor Dutch that have been reported in the literature up to now.

9.2 Future Work

9.2.1 Semantic Information

Continuing the underlying structure of the thesis, following the kind of lin-guistic information added to the WSD system, semantic information has notbeen used so far. There are several possible ways of taking advantage fromsemantic information.

Topic Information As we have explained in chapter 3, information bey-ond the sentence level is not taken into account in the current system. So anobvious extension of the existing WSD system would be to include topicalinformation. Agirre and Martınez (2001) argue that using more than senten-tial context with a corpus-based algorithm implicitly acquires topical inform-ation from the wider context. One approach would therefore be to includethe content words contained in a predefined number of sentences precedingand/or following the sentence containing a given ambiguous word. FollowingAgirre and Martınez’s reasoning, adding the collected content words as fea-ture should capture topic information. Another way of including informationof the topic would be to use a topic finder. Topic detection has been triedin various ways and fields with reasonable success (see e.g. Flynn and Dun-nion (2004); Chali (2001); Makkonen et al. (2004)). Using such a specializedtool has the advantage that the desired information can be encoded moreexplicitly and more reliably.

Domain Information Another kind of information that could be takeninto consideration is domain information. So far, it has only scarcely beenused in WSD research. Magnini et al. (2002) report on experiments using an

9.2. Future Work 113

version of WordNet extended with domain annotations to test the intuitionthat “word senses occurring in a coherent section of text tend to maximizetheir belonging to the same domain” (p. 359). The various senses of theword to be disambiguated are checked against the domain categories of othercontent words in a given portion of text and the best matching domain ischosen. A prerequisite for this approach is the availability of a lexical resourceannotated with domains. Also, fine-grained sense distinctions pertaining tothe same general domain are difficult to capture with this technique. How-ever, it can definitely provide a valuable first “filter” to restrict the possiblesenses to the most probable ones based on domain coherence. Hence, domaininformation could be used to (dynamically) restrict the sense inventory.

Co-occurrence Information Also, co-occurrence data could be used infuture work:

“The technique [of gathering co-occurrence data from corpora] iswell-known in information theory. Words that tend to have thesame co-occurrence pattern also tend to be similar in meaning(Sparck-Jones and Willett, 1997). This can be applied to inde-pendent corpora or to the definitions themselves.” (Vossen et al.,1999a)

This method can be seen as a way of counter-acting data sparseness by ex-tracting co-occurrence patterns of different words and computing their sim-ilarity. Especially on the way to less data-intensive WSD methods, thistechnique could be employed to extract examples of words without trainingsamples based on their similarity to another ambiguous word. Also, hand-annotation of new data might be easier and faster using this technique.

Further Semantic Information Selectional restrictions are another po-tential source of semantic information. They encode knowledge similar tosubcategorization frames and argument-head relations, but at the same time,the information is more abstract since it is given in terms of semantic classesrather than word forms. Also, the information about semantic roles, suchas whether a given word is an agent or the theme of a sentence, might holdimportant cues to resolve interdependent ambiguities of words in the samesentence.

9.2.2 EuroWordNet to Acquire More Data

Many WSD systems for English extensively use WordNet and especially itshierarchical structure (Hawkins, 1999; Li et al., 1995; Lin, 1997; Mihalcea and


Moldovan, 1999, 2001b). As we have already mentioned in chapter 2, Euro-WordNet (EWN) contains less information than its English counter-part.Incorporating glosses into the Dutch ontology would add a lot of possibilitiesfor using EWN for WSD, but also a more extensive number of (annotated)semantic relations is sorely missed.

Since the Dutch Senseval-2 data has not been annotated with EWNsenses, an obvious first step would be to find a mapping between the exist-ing sense inventory and the EWN ontology. There are several difficulties tobe overcome, however, in devising a strategy for a mapping with regard tothe differences between the Senseval-2 sense inventory and the structureof EWN. First of all, the Senseval-2 sense inventory for Dutch is non-hierarchical while the Dutch EWN is hierarchically organized (albeit in arather flat hierarchy). Also, the Dutch Senseval-2 sense inventory containsmulti-word lexemes which are not incorporated in the Dutch EWN and thesense tags are PoS ambiguous whereas the Dutch EWN is divided by PoS.Furthermore, the Dutch sense inventory includes “empty” senses [=] in con-trast to EWN that assigns at least one but possibly more distinct senses forevery word. Last but not least, the Senseval-2 data has a restricted vocab-ulary (due to it being collected from children’s books) and, consequently, re-stricted senses associated with the annotated ambiguous words. The DutchEWN, on the other hand, is a complete semantic network for the most fre-quent words in the Dutch language where all possible senses are taken intoconsideration. This means that when a mapping is attempted, it will oftenbe the case that many more varying senses will be found in EWN than inthe original sense inventory.

Exploiting the fact that many sense tags in the Dutch Senseval-2 dataconsist of the lemma of the ambiguous word in combination with a synonym,near-synonym or antonym of a particular sense, a possible way to achieve amapping is to extract all corresponding synsets for a given ambiguous wordin EWN and then compare the sense description of the Senseval-2 tagswith the synonyms, hyperonyms, hyponyms and antonyms associated with agiven synset.

Our main idea is to use this mapping from Senseval-2 senses to theDutch EWN hierarchy in order to get more training material for each am-biguous word in an unsupervised way. In a first step, unambiguous synonymsof a certain sense s of an ambiguous word w need to be found in the EWNsynsets. Then, using a large corpus of Dutch, such as e.g. the Twente NieuwsCorpus1, all sentences containing an unambiguous synonym of s are extractedand used as additional unannotated training material (“pseudo samples”) for

1See http://wwwhome.cs.utwente.nl/~druid/TWNC/TWNC-main.html.

9.2. Future Work 115

that particular sense s (Leacock et al., 1998). Wang and Matsumoto (2004)have tried a similar idea for WSD of Chinese reaching the conclusion thatpseudo samples are helpful, but sense-tagged samples remain more inform-ative for WSD.

It would incontestably also be useful to sense-tag more Dutch text usingthe EuroWordNet as an inventory—whether by hand or using less expensivemethods such as bootstrapping. On the one hand, new annotated data wouldallow us to test our strategies on a larger data set. On the other hand, dataannotated with a hierarchical sense inventory opens the possibility to exploreother approaches to WSD as well.

9.2.3 Other Languages

In the context of this thesis, we have only applied our WSD system to Dutch.It would definitely be interesting to apply our findings to other languages.Even though most research has been done on English, it would neverthelessbe valuable to compare our approach to existing systems. Since English is alanguage with little inflectional morphology, it is not certain that our lemma-based approach will lead to significant improvements. With other languages,such as German or Italian, morphology is of greater influence and there-fore a lemma-based approach could yield an amelioration in disambiguationaccuracy.

We also think that languages with a free word order will profit fromusing dependency relations as a feature. They capture structural syntacticinformation which might not be found when using only the context.

9.2.4 Applications

In this thesis we have only been concerned with a stand-alone WSD systemfor Dutch. This choice was made in order to investigate the use of linguisticknowledge for WSD and assess its impact on stand-alone disambiguationaccuracy. A possible extension of the present work would therefore be toinclude and test WSD in general and this WSD module in particular inconcrete NLP applications.

The most obvious application is MT. In a first step, a suitable sense in-ventory for the languages considered needs to be established, deciding onissues such as the granularity of sense distinctions. Then, the source texthas to be disambiguated in order to facilitate the choice of translations forambiguous words. It would be very interesting to compare the quality andaccuracy of the target text translated with and without word sense disam-biguation information, also with respect to restricted domains and free texts.


Another application which could benefit from WSD is IR (Schutze andPedersen, 1995). The difficulty with applying WSD to IR is that many po-tential documents need to be “disambiguated” accurately in order for WSDto improve retrieval accuracy. Moreover, the human formulating the querywould have to be very precise in the choice of the intended meaning. Other-wise the query results might not correspond to the expectations even thoughthe disambiguation accuracy might be good.

Further applications in which WSD could lead to improvement includequestion answering, speech synthesis, grammatical analysis (e.g. parsing),text processing (such as converting all capitals to regular caption or accentrestoration (Yarowsky, 1994)). Recently, using WSD for the semantic an-notation of corpora (Kral, 2004) or for the evaluation of automatically builtWordNet-like ontologies (Tufis et al., 2004) have also been proposed.

Bibliography

Agirre, E. and Martınez, D. (2000). Exploring automatic word sense dis-ambiguation with decision lists and the web. In Proceedings of the Coling2000 Workshop “Semantic Annotation and Intelligent Annotation”, CentreUniversitaire, Luxembourg.

Agirre, E. and Martınez, D. (2001). Knowledge sources for word sense dis-ambiguation. In Matousek, V., Mautner, P., Moucek, R., and Tauser,K., editors, Proceedings of the Fourth International Conference TSD 2001,Plzen, Notes in Computer Science, pages 1–10, Berlin. Springer Verlag.

Agirre, E. and Martınez, D. (2002). Integrating selectional preferences inWordNet. In Proceedings of the First International WordNet Conference,Mysore.

Agirre, E. and Rigau, G. (1996). Word sense disambiguation using con-ceptual density. In Proceedings of the 16th International Conference onComputational Linguistics (Coling 1996), pages 16–22, Copenhagen.

Atkins, S. (1993). Tools for computer-aided corpus lexicography: The Hectorproject. Acta Linguistica Hungarica, 41:5–72.

Baayen, R. H., Piepenbrock, R., and van Rijn, A. (1993). The CELEXlexical database (CD-ROM). Linguistic Data Consortium, University ofPennsylvania, Philadelphia.

Balay, S., Gropp, W., Kaushik, D., McInnes, L. C., and Smith, B. (2002).PETSc users manual. Technical Report ANL-95/11, Argonne NationalLaboratory, Argonne. Revision 2.1.2.

Balay, S., Gropp, W., McInnes, L. C., and Smith, B. (1997). Efficient manag-ment of object oriented numerical software libraries. In Arge, E., Bruaset,A. M., and Langtangen, H. P., editors, Modern Software Tools in ScientificComputing, pages 163–202. Birkhauser Press, Basel.

117

118 Bibliography

Basili, R., Rocca, M. D., and Pazienza, M. T. (1997). Towards a bootstrap-ping framework for corpus semantic tagging. In ACL SIGLEX Workshopon Tagging Text with Lexical Semantics: Why, What, and How?, Wash-ington, D.C.

Benson, S., McInnes, L. C., More, J., and Sarich, J. (2002). TAO usersmanual. Technical Report ANL/MCS-TM-242, Argonne National Labor-atory, Argonne. Revision 1.4.

Berger, A., Pietra, S. D., and Pietra, V. D. (1996). A maximum en-tropy approach to natural language processing. Computational Linguistics,22(1):39–71.

Berghmans, J. (1994). WOTAN—een automatische grammaticale taggervoor het Nederlands. Master’s thesis, Nijmegen University, Nijmegen.

Bouma, G., van Noord, G., and Malouf, R. (2001). Alpino: Wide-coveragecomputational analysis of Dutch. In Daelemans, W., Sima’an, K., Veen-stra, J., and Zavrel, J., editors, Computational Linguistics in the Nether-lands 2000, pages 45–59, Amsterdam. Rodopi.

Brill, E. (1995). Transformation-based error-driven learning and natural lan-guage processing: A case study in part of speech tagging. ComputationalLinguistics, 21(4):543–565.

Briscoe, T. and Carroll, J. (2002). Robust accurate statistical annotationof general text. In Proceedings of the Third International Conference onLanguage Resources and Evaluation (LREC 2002), pages 1499–1504, LasPalmas, Gran Canaria.

Bruce, R. and Wiebe, J. (1994). Word-sense disambiguation using decompos-able models. In 32th Annual Meeting of the Association for ComputationalLinguistics (ACL 1994), pages 139–146, Las Cruces.

Bruce, R. and Wiebe, J. (1998). Word-sense distinguishability and inter-coder agreement. In Proceedings of the 3rd Conference on Empirical Meth-ods in Natural Language Processing (EMNLP-98), pages 53–60, Granada.

Buijs, A. (1997). Statistiek om mee te werken. Educatieve Partners Neder-land, Houten, 6th edition.

Busemann, S., Schmeier, S., and Arens, R. (2000). Message classification inthe call center. In Proceedings of the 6th Conference on Applied NaturalLanguage Processing (ANLP 2000), Seattle.

Bibliography 119

Carroll, J., Briscoe, T., and Sanfilippo, A. (1998a). Parser evaluation: A sur-vey and a new proposal. In Proceedings of the First International Confer-ence on Language Resources and Evaluation (LREC 1998), pages 447–454,Granada.

Carroll, J. and McCarthy, D. (2000). Word sense disambiguation using auto-matically acquired verbal preferences. Computers and the Humanities,34(1–2):109–114.

Carroll, J., Minnen, G., and Briscoe, T. (1998b). Can subcategorisation prob-abilities help a statistical parser? In Proceedings of the 6th ACL/SIGDATWorkshop on Very Large Corpora, pages 118–126, Montreal.

Chali, Y. (2001). Topic detection using lexical chains. In Monostori, L.,Vancza, J., and Ali, M., editors, Proceedings of the 14th InternationalConference on Industrial and Engineering Applications of Artificial Intel-ligence and Expert Systems (IEA/AIE 2001), Lecture Notes in ComputerScience 2070, pages 552–558, Budapest. Springer Publisher.

Chapman, R., editor (1977). Roget’s International Thesaurus. Harper andRow, New York, 4th edition.

Chen, S. and Rosenfeld, R. (2000). A survey of smoothing techniques for MEmodels. IEEE Transactions on Speech and Audio Processing, 8(1):37–50.

Chodorow, M., Leacock, C., and Miller, G. (2000). A topical/local classifierfor word sense identification. Computers and the Humanities, 34(1–2):115–120.

Choueka, Y. and Lusignan, S. (1985). Disambiguation by short contexts.Computers and the Humanities, 19:147–158.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educationaland Psychological Measurement, 20:37–46.

Collins, M. (1999). Head-Driven Statistical Models for Natural LanguageParsing. PhD thesis, Computer and Information Science Department, Uni-versity of Pennsylvania, Philadelphia.

Cowie, J., Guthrie, J., and Guthrie, L. (1992). Lexical disambiguation usingsimulated annealing. In Proceedings of the 15th [sic] International Confer-ence on Computational Linguistics (Coling 1992), pages 359–365, Nantes.

120 Bibliography

Daciuk, J. (2000). Finite state tools for natural language processing. In Pro-ceedings of the Coling 2000 Workshop “Using Toolsets and Architecturesto Build NLP Systems”, pages 34–37, Centre Universitaire, Luxembourg.

Daciuk, J. and van Noord, G. (2004). Finite automata for compact represent-ation of tuple dictionaries. Theoretical Computer Science, 313(1):45–56.

Daelemans, W. and Hoste, V. (2002). Evaluation of machine learning meth-ods for natural language processing tasks. In Proceedings of the ThirdInternational Conference on Language Resources and Evaluation (LREC2002), pages 755–760, Las Palmas, Gran Canaria.

Daelemans, W., van den Bosch, A., and Zavrel, J. (1999). Forgetting excep-tions is harmful in language learning. Machine Learning, 34(1):11–43.

Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. (2002a).MBT: Memory-Based tagger, reference guide. Technical Report ILK 02-09, Induction of Linguistic Knowledge, Computational Linguistics, TilburgUniversity, Tilburg. version 1.0.

Daelemans, W., Zavrel, J., van der Sloot, K., and van den Bosch, A. (2002b).TiMBL: Tilburg Memory-Based learner, reference guide. Technical ReportILK 02-10, Induction of Linguistic Knowledge, Computational Linguistics,Tilburg University, Tilburg. version 4.3.

Dagan, I. and Itai, A. (1994). Word sense disambiguation using a secondlanguage monolingual corpus. Computational Linguistics, 20(4):563–596.

Dagan, I., Itai, A., and Schwall, U. (1991). Two languages are more informat-ive than one. In 29th Annual Meeting of the Association for ComputationalLinguistics (ACL 1991), pages 130–137, Berkeley.

Diab, M. and Resnik, P. (2002). An unsupervised method for word sensetagging using parallel corpora. In 40th Annual Meeting of the Associationfor Computational Linguistics (ACL 2002), pages 255–262, Philadelphia.

Dini, L., di Tomaso, V., and Segond, F. (2000). GINGER II: An example-driven word sense disambiguator. Computers and the Humanities, 34(1–2):121–126.

Domingos, P. and Pazzani, M. (1997). On the optimality of the simplebayesian classifier under zero-one loss. Machine Learning, 29:103–130.

Bibliography 121

Drenth, E. (1997). Using a hybrid approach towards Dutch part-of-speechtagging. Master’s thesis, Alfa-Informatica, University of Groningen,Groningen.

Duda, R. and Hart, P. (1973). Pattern Classification and Scene Analysis.John Wiley and Sons, New York.

Edmonds, P. and Cotton, S. (2001). Senseval-2: Overview. In Proceedingsof Senseval-2, Second International Workshop on Evaluating Word SenseDisambiguation Systems, pages 1–5, Toulouse.

Edmonds, P. and Kilgarriff, A. (2002). Introduction to the special issue onevaluating word sense disambiguation systems. Natural Language Engin-eering, Special Issue on Word Sense Disambiguation Systems, 8(4):279–291.

Escudero, G., Marquez, L., and Rigau, G. (2000a). Boosting applied to wordsense disambiguation. In Proceedings of the 12th European Conference onMachine Learning (ECML), pages 129–141, Barcelona.

Escudero, G., Marquez, L., and Rigau, G. (2000b). A comparison betweensupervised learning algorithms for word sense disambiguation. In Proceed-ings of the 4th Conference on Computational Natural Language Learning(CoNLL-2000), pages 31–36, Lissabon.

Escudero, G., Marquez, L., and Rigau, G. (2000c). On the portability andtuning of supervised word sense disambiguation systems. Technical ReportLSI-00-30, Software Department (LSI), Technical University of Catalonia(UPC).

Federici, S., Montemagni, S., and Pirrelli, V. (1999). Sense: an analogy-based word sense disambiguation system. Natural Language Engineering,5(2):207–218.

Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database. MITPress, Cambridge.

Florian, R., Cucerzan, S., Schafer, C., and Yarowsky, D. (2002). Combiningclassifiers for word sense disambiguation. Natural Language Engineering,Special Issue on Word Sense Disambiguation Systems, 8(4):327–341.

Flynn, C. and Dunnion, J. (2004). Domain-informed topic detection. InGelbukh, A., editor, Proceedings of the Fifth International Conference onIntelligent Text Processing and Computational Linguistics (CICLing-04),

122 Bibliography

Lecture Notes in Computer Science 2945, pages 617–626, Seoul. SpringerPublisher.

Frakes, W. B. and Baeza-Yates, R., editors (1992). Information Retrieval:Data Structures and Algorithms. Prentice Hall, Upper Saddle River.

Freund, J. (2004). Modern Elementary Statistics. Prentice Hall, UpperSaddle River, 11th edition.

Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization ofon-line learning and an application to boosting. Computer and SystemSciences, 55(1):119–139.

Fujii, A. (1998). Corpus-Based Word Sense Disambiguation. PhD thesis,Tokyo Institute of Technology, Tokyo.

Gahl, S. (1998). Automatic extraction of subcategorization frames for corpus-based dictionary-building. In Proceedings of the 8th EURALEX Interna-tional Congress (EURALEX’98), pages 445–452, Liege.

Gale, B., Church, K., and Yarowsky, D. (1992a). Estimating upper andlower bounds on the performance of word-sense disambiguation programs.In 30th Annual Meeting of the Association for Computational Linguistics(ACL 1992), pages 249–256, Newark.

Gale, B., Church, K., and Yarowsky, D. (1992b). A method for disambiguat-ing word senses in a corpus. Computers and the Humanities, 26:415–439.

Gale, B., Church, K., and Yarowsky, D. (1992c). One sense per discourse.In Proceedings of the ARPA Workshop on Speech and Natural LanguageProcessing, pages 233–237.

Gale, B., Church, K., and Yarowsky, D. (1992d). Work on statistical methodsfor word sense disambiguation. In AAAI Fall Symposium on ProbabilisticApproaches to Natural Language, pages 54–60, Cambridge.

Gaustad, T. (2001). Statistical corpus-based word sense disambiguation:Pseudowords vs. real ambiguous words. In Companion Volume to theProceedings of the 39th Annual Meeting of the Association for Compu-tational Linguistics (ACL/EACL 2001) – Proceedings of the Student Re-search Workshop, pages 61–66, Toulouse.

Gaustad, T. (2003). The importance of high quality input for WSD: Anapplication-oriented comparison of part-of-speech taggers. In Proceedings

Bibliography 123

of the Australasian Language Technology Workshop (ALTW 2003), pages65–72, Melbourne.

Gaustad, T. (2004). A lemma-based approach to a maximum entropy wordsense disambiguation system for Dutch. In Proceedings of the 20th Inter-national Conference on Computational Linguistics (Coling 2004), pages778–784, Geneva.

Gaustad, T. and Bouma, G. (2002). Accurate stemming of Dutch for textclassification. In Theune, M., Nijholt, A., and Hondorp, H., editors, Com-putational Linguistics in the Netherlands 2001, pages 104–117, Amster-dam. Rodopi.

Good, I. (1953). The populations of frequencies of species and the estimationof population parameters. Biometrika, 40:237–264.

Harman, D. (1991). How effective is suffixing. Journal of the AmericanSociety for Information Science, 42(1):7–15.

Hawkins, P. (1999). DURHAM: A Word Sense Disambiguation System. PhDthesis, Laboratory for Natural Language Engineering, Department of Com-puter Science, University of Durham, Durham.

Hawkins, P. and Nettleton, D. (2000). Large scale WSD using learning ap-plied to Senseval. Computers and the Humanities, 34(1–2):135–140.

Haynes, S. (2001). Semantic tagging using WordNet examples. In Proceedingsof Senseval-2, Second International Workshop on Evaluating Word SenseDisambiguation Systems, pages 79–82, Toulouse.

Hearst, M. (1991). Noun homograph disambiguation using local context inlarge text corpora. In Proceedings ot the 7th Annual Conference of the UWCentre for the New OED and Text Research: Using Corpora, Oxford.

Hendrickx, I. and van den Bosch, A. (2001). Dutch word sense disambigu-ation: Data and preliminary results. In Proceedings of Senseval-2, SecondInternational Workshop on Evaluating Word Sense Disambiguation Sys-tems, pages 13–16, Toulouse.

Hendrickx, I., van den Bosch, A., Hoste, V., and Daelemans, W. (2002).Dutch word sense disambiguation: Optimizing the localness of context. InProceedings of the ACL 2002 Workshop on Word Sense Disambiguation:Recent Successes and Future Directions, Philadelphia.

124 Bibliography

Hirst, G. (1987). Semantic Interpretation and the Resolution of Ambiguity.Cambridge University Press, Cambridge.

Hoste, V., Daelemans, W., Hendrickx, I., and van den Bosch, A. (2002a).Evaluating the results of a Memory-Based word-expert approach to un-restricted word sense disambiguation. In Proceedings of the ACL 2002Workshop on Word Sense Disambiguation: Recent Successes and FutureDirections, pages 95–101, Philadelphia.

Hoste, V., Hendrickx, I., Daelemans, W., and van den Bosch, A. (2002b).Parameter optimization for machine-learning of word sense disambigu-ation. Natural Language Engineering, Special Issue on Word Sense Dis-ambiguation Systems, 8(4):311–325.

Hudson, R. (2003). The psychological reality of syntactic dependency rela-tions. In First International Conference on Meaning-Text Theory (MTT2003), pages 169–180, Paris. Invited talk.

Ide, N., Erjavec, T., and Tufis, D. (2002). Sense discrimination with parallelcorpora. In Proceedings of the ACL 2002 Workshop on Word Sense Dis-ambiguation: Recent Successes and Future Directions, pages 56–60, Phil-adelphia.

Ide, N. and Veronis, J. (1998). Introduction to the special issue on wordsense disambiguation: The state of the art. Computational Linguistics,24(1):1–40.

Karov, Y. and Edelman, S. (1996). Learning similarity-based word sensedisambiguation from sparse data. In Proceedings of the 4th Workshop onvery large corpora, Copenhagen.

Kelly, E. and Stone, P. (1975). Computer Recognition of English WordSenses. North-Holland linguistic series 13. North-Holland, Amsterdam.

Kilgarriff, A. (1994). The myth of completeness and some problems withconsistency (the role of frequency in deciding what goes in the diction-ary). In Proceedings of the 6th International Congress on Lexicography(EURALEX’94), pages 101–106, Amsterdam.

Kilgarriff, A. (1997). “I don’t believe in word senses”. Computers and theHumanities, 31:97–113.

Kilgarriff, A. (1998a). Gold standard datasets for evaluating word sensedisambiguation programs. Computer Speech and Language, Special Issueon Evaluation, 12(4):453–472.

Bibliography 125

Kilgarriff, A. (1998b). Senseval: An exercise in evaluating word sense dis-ambiguation programs. In Proceedings of the First International Confer-ence on Language Resources and Evaluation (LREC 1998), pages 581–588,Granada.

Kilgarriff, A. (2001). English lexical sample task description. In Proceedingsof Senseval-2, Second International Workshop on Evaluating Word SenseDisambiguation Systems, pages 17–20, Toulouse.

Kilgarriff, A. and Palmer, M. (2000). Special issue on Senseval: Evaluat-ing word sense disambiguation programs. Computers and the Humanities,34(1–2).

Kilgarriff, A. and Rosenzweig, J. (2000). Framework and results for EnglishSenseval. Computers and the Humanities, 34(1–2):15–48.

Klein, D. and Manning, C. (2003). Maxent models, conditional estimation,and optimization without the magic. ACL 2003 Tutorial Notes. Sapporo.

Klein, D., Toutanova, K., Ilhan, H. T., Kamvar, S., and Manning, C. (2002).Combining heterogeneous classifiers for word-sense disambiguation. In Pro-ceedings of the ACL 2002 Workshop on Word Sense Disambiguation: Re-cent Successes and Future Directions, Phildelphia.

Kraaij, W. (2004). Variations on Language Modeling for Information Re-trieval. PhD thesis, Computer Science, University of Twente, Enschede.

Kraaij, W. and Pohlman, R. (1994). Porter’s stemming algorithm for Dutch.In Noordman, L. and de Vroomen, W., editors, Informatiewetenschap1994: Wetenschapelijke bijdragen aan de derde STINFON Conferentie,pages 167–180, Tilburg.

Kraaij, W. and Pohlman, R. (1995). Evaluation of a Dutch stemming al-gorithm. In The New Review of Document and Text Management, pages23–45. Taylor Graham, London.

Kraaij, W. and Pohlman, R. (1996). Using linguistic knowledge in informa-tion retrieval. OTS Working Paper OTS-WP-CL-96-001, OTS, Universityof Utrecht, Utrecht.

Kral, R. (2004). Semantic annotating of Czech corpus via WSD. In Proceed-ings of the Fourth International Conference on Language Resources andEvaluation (LREC 2004), pages 1807–1810, Lisbon.

126 Bibliography

Krovetz, R. (1993). Viewing morphology as an inference process. In Horfhage,R., Rasmussen, E., and Willett, P., editors, Proceedings of the 16th AnnualInternational ACM-SIGIR Conference on Research and Development inInformation Retrieval, pages 191–203, Pittsburgh.

Krovetz, R. (1998). More than one sense per discourse. In Web-Based Pro-ceedings of Senseval-1, Herstmonceux Castle, Sussex.

Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesianclassifiers. In Proceedings of the 10th National Conference on ArtificialIntelligence (AAAI-92), San Jose.

Leacock, C., Chodorow, M., and Miller, G. (1998). Using corpus statisticsand WordNet relations for sense identification. Computational Linguistics,24(1):147–165.

Lee, Y. K. and Ng, H. T. (2002). An empirical evaluation of knowledgesources and learning algorithms for word sense disambiguation. In Pro-ceedings of the 2002 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP-2002), pages 41–48, Philadelphia.

Lesk, M. (1986). Automatic sense disambiguation using machine readabledictionaries: how to tell a pine cone from an ice cream cone. In Proceedingsof ACM SIGDOC Conference, pages 24–26, Toronto.

Li, X., Szpakowics, S., and Matwin, S. (1995). A WordNet-based algorithmfor word sense disambiguation. In Proceedings of the 14th InternationalJoint conference on Artificial Intelligence, pages 1368–1374, Montreal.

Lin, D. (1993). Principle-based parsing without overgeneration. In 31thAnnual Meeting of the Association for Computational Linguistics (ACL1993), pages 112–120, Columbus.

Lin, D. (1997). Using syntactic dependency as local context to resolve wordsense ambiguity. In 35th Annual Meeting of the Association for Compu-tational Linguistics and 8th Conference of the European Chapter of theAssociation for Computational Linguistics (ACL/EACL 1997), pages 64–71, Madrid.

Lin, D. (2000). Word sense disambiguation with a similarity-smoothed caselibrary. Computers and the Humanities, 34(1–2):147–152.

Litkowski, K. (2000). Senseval: The CL research experience. Computers andthe Humanities, 34(1–2):153–158.

Bibliography 127

Litkowski, K. (2001). Use of machine readable dictionnaries for word-sensedisambiguation in Senseval-2. In Proceedings of Senseval-2, Second Inter-national Workshop on Evaluating Word Sense Disambiguation Systems,pages 107–110, Toulouse.

Litkowski, K. (2004a). Senseval-3 task: Automatic labeling of semantic roles.In Proceedings of Senseval-3, Third International Workshop on the Evalu-ation of Systems for the Semantic Analysis of Text, pages 9–12, Barcelona.

Litkowski, K. (2004b). Senseval-3 task: Word sense disambiguation of Word-Net glosses. In Proceedings of Senseval-3, Third International Workshopon the Evaluation of Systems for the Semantic Analysis of Text, pages13–16, Barcelona.

Magnini, B., Strapparava, C., Pezzulo, G., and Gliozzo, A. (2002). Therole of domain information in word sense disambiguation. Natural Lan-guage Engineering, Special Issue on Word Sense Disambiguation Systems,8(4):359–373.

Makkonen, J., Ahonen-Myka, H., and Salmenkivi, M. (2004). Simple se-mantics in topic detection and tracking. Information Retrieval, 7(3–4):347–368.

Malouf, R. (2002). A comparison of algorithms for maximum entropy para-meter estimation. In Proceedings of the Sixth Conference on Natural Lan-guage Learning (CoNLL-2002), pages 49–55.

Manning, C. and Schutze, H. (1999). Foundations of Statistical NaturalLanguage Processing. MIT Press, Cambridge.

Martınez, D., Agirre, E., and Marquez, L. (2002). Syntactic features forhigh precision word sense disambiguation. In Proceedings of the 19th In-ternational Conference on Computational Linguistics (Coling 2002), pages626–632, Taipei.

McCarthy, D. and Carroll, J. (2003). Disambiguating nouns, verbs, andadjectives using automatically acquired selectional preferences. Computa-tional Linguistics, 29(4):639–654.

McRoy, S. (1992). Using multiple knowledge sources for word sense discrim-ination. Computational Linguistics, 18(1):1–30.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E.(1953). Equation state calculations by fast computing machines. Journalof Chemical Physics, 21:1087–1092.

128 Bibliography

Mihalcea, R. and Moldovan, D. (1998). Word sense disambiguation basedon semantic density. In Proceedings of the Coling-ACL’98 Workshop “Us-age of WordNet in Natural Language Processing Systems”, pages 16–22,Montreal.

Mihalcea, R. and Moldovan, D. (1999). A method for word sense disambig-uation of unrestricted text. In 37th Annual Meeting of the Association forComputational Linguistics (ACL 1999), pages 152–158, Maryland.

Mihalcea, R. and Moldovan, D. (2001a). A highly accurate bootstrapping al-gorithm for word sense disambiguation. International Journal on ArtificialIntelligence Tools, 10(1–2):5–21.

Mihalcea, R. and Moldovan, D. (2001b). Pattern learning and actrive featureselection for word sense disambiguation. In Proceedings of Senseval-2,Second International Workshop on Evaluating Word Sense DisambiguationSystems, pages 127–130.

Miller, G., Leacock, C., Tengi, R., and Bunker, R. (1993). A semantic con-cordance. In Proceedings of the ARPA Workshop on Human LanguageTechnology. Morgan Kauffmann, San Mateo.

Miller, G., Leacock, C., Tengi, R., Bunker, R., and Miller, K. (1990). Fivepapers on WordNet. Special Issue of International Journal of Lexicography,3(4).

Monz, C. (2003). From Document Retrieval to Question Answering. PhDthesis, Institute for Logic, Language and Computation, University of Am-sterdam, Amsterdam.

Mooney, R. (1996). Comparative experiments on disambiguating word senses:An illustration of the role of bias in machine learning. In Proceedingsof the Conference on Empirical Methods in Natural Language Processing(EMNLP-96), pages 82–91, University of Pennsylvania.

Moortgat, M., Schuurman, I., and van der Wouden, T. (2002). CGN Syn-tactische Annotatie. CGN Project.

Nakov, P. and Hearst, M. (2003). Category-based pseudowords. In Proceed-ings of the Human Language Technology Conference (HLT-NAACL 2003),pages 67–69, Edmonton.

Nerbonne, J. (1993). A feature-based syntax/semantics interface. Annals ofMathematics and Artificial Intelligence, 8:107–132.

Bibliography 129

Ng, H. T. (1997). Getting serious about word sense disambiguation. In ACLSIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What,and How?, Washington, D.C.

Ng, H. T. and Lee, H. B. (1996). Integrating multiple knowledge sources todisambiguate word sense: An exemplar-based approach. In 34th AnnualMeeting of the Association for Computational Linguistics (ACL 1996),pages 40–47, Santa Cruz.

Ng, H. T., Wang, B., and Chan, Y. S. (2003). Exploiting parallel texts forword sense disambiguation. In 41th Annual Meeting of the Association forComputational Linguistics (ACL 2003), pages 455–462, Sapporo.

Pedersen, T. (1998). Learning Probabilistic Models of Word Sense Disambig-uation. PhD thesis, Southern Methodist University, Dallas.

Pedersen, T. (2000). A simple approach to building ensembles of naivebayesian classifiers for word sense disambiguation. In Proceedings of theFirst Conference of the North American Chapter of the Association forComputational Linguistics (NAACL 2000), pages 63–69, Seattle.

Pedersen, T. (2002). Evaluating the effectiveness of ensembles of decisiontrees in disambiguationg Senseval lexical samples. In Proceedings of theACL 2002 Workshop on Word Sense Disambiguation: Recent Successesand Future Directions, Philadelphia.

Pedersen, T. and Banerjee, S. (2002). An adapted lesk algorithm for wordsense disambiguation using WordNet. In Gelbukh, A., editor, Proceed-ings of the Third International Conference on Intelligent Text Processingand Computational Linguistics (CICLing-02), Lecture Notes in ComputerScience 2276, Mexico City. Springer Publisher.

Pedersen, T. and Bruce, R. (1997a). Distinguishing word senses in untaggedtext. In Proceedings of the 2nd Conference on Empirical Methods in Nat-ural Language Processing (EMNLP-97), pages 197–207, Providence.

Pedersen, T. and Bruce, R. (1997b). A new supervised learning algorithm forword sense disambiguation. In Proceedings of the 14th National Conferenceon Artificial Intelligence (AAAI-97), pages 604–609, Providence.

Pedersen, T., Bruce, R., and Wiebe, J. (1997). Sequential model selectionfor word sense disambiguation. In Proceedings of the 5th Conference onApplied Natural Language Processing (ANLP-97), pages 388–396, Wash-ington, D.C.

130 Bibliography

Popovic, M. and Willett, P. (1992). The effectiveness of stemming for naturallanguage access to Slovene textual data. Journal of the American Societyfor Information Science, 43(5):384–390.

Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3):130–137.

Preiss, J. (2004a). Probabilistic word sense disambiguation. Computer Speechand Language, 18(3):319–337.

Preiss, J. (2004b). Probabilistic WSD in senseval-3. In Proceedings ofSenseval-3, Third International Workshop on the Evaluation of Systemsfor the Semantic Analysis of Text, pages 213–216, Barcelona.

Preiss, J. and Korhonen, A. (2004). WSD for subcategorization acquistiontask description. In Proceedings of Senseval-3, Third International Work-shop on the Evaluation of Systems for the Semantic Analysis of Text, pages33–36, Barcelona.

Prins, R. and van Noord, G. (2001). Unsupervised pos-tagging improvesparsing accuracy and parsing efficiency. In Proceedings of the InternationalWorkshop on Parsing Technologies (IWPT 2001), pages 154–165, Beijing.

Prins, R. and van Noord, G. (2004). Reinforcing parser preferences throughtagging. Traitement automatique des langues, Special issue on Evolutionsin Parsing, 44(3):121–139.

Procter, P., editor (1978). Longman Dictionary of Contemporary English.Longman Group Limited, Harlow.

Pustejovsky, J. (1995). The Generative Lexicon. MIT Press, Cambridge.

Pustejovsky, J. and Boguraev, B., editors (1996). Lexical Semantics: TheProblem of Polysemy. Clarendon Press, Oxford.

Ratnaparkhi, A. (1997). A simple introduction to maximum entropy modelsfor natural language processing. Technical Report IRCS Report 97-08,IRCS, University of Pennsylvania, Philadelphia.

Ravin, Y. and Leacock, C., editors (2000). Polysemy: Theoretical and Com-putational Approaches. Oxford University Press, Oxford.

Resnik, P. (1997). Selectional preferences and sense disambiguation. In ACLSIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What,and How?, pages 52–57, Washington, D.C.

Bibliography 131

Resnik, P. and Yarowsky, D. (1997). A perspective on word sense disam-biguation methods and their evaluation. In ACL SIGLEX Workshop onTagging Text with Lexical Semantics: Why, What, and How?, Washington,D.C.

Resnik, P. and Yarowsky, D. (1999). Distinguishing systems and distinguish-ing senses: new evaluation methods for word sense disambiguation. NaturalLanguage Engineering, 5(2):113–133.

Riloff, E. (1995). Little words can make a big difference for text classific-ation. In Fox, E., Ingwersen, P., and Fidel, R., editors, Proceedings ofthe 18th Annual International ACM-SIGIR Conference on Research andDevelopment in Information Retrieval, pages 130–136, Seattle.

Schrooten, W. and Vermeer, A. (1994). Woorden in het basisonderwijs.15.000 woorden aangeboden aan leerlingen, volume 6 of Studies in meert-aligheid. Tilburg University Press, Tilburg.

Schutze, H. (1992). Context space. In AAAI Fall Symposium on ProbabilisticApproaches to Natural Language, pages 113–120, Cambridge.

Schutze, H. (1998). Automatic word sense disambiguation. ComputationalLinguistics, 24(1):97–123.

Schutze, H. and Pedersen, J. (1995). Information retrieval based on wordsenses. In Fourth Annual Symposium on Document Analysis and Inform-ation Retrieval, pages 161–175, Las Vegas.

Small, S., Cottrell, G., and Tanenhaus, M., editors (1988). Lexical Ambigu-ity Resolution, Perspectives from Psycholinguistics, Neuropsychology, andArtificial Intelligence. Morgan Kaufmann, San Mateo.

Sparck-Jones, K. and Willett, P., editors (1997). Readings in InformationRetrieval. Morgan Kaufmann, San Mateo.

Spitters, M. (2000). Comparing feature sets for learning text categoriza-tion. In Proceedings of the 6th Conference on Content-Based MultimediaInformation Access (RIAO 2002), pages 1124–1135, Paris.

Stetina, J., Kurohashi, S., and Nagao, M. (1998). General word sense dis-ambiguation method based on a full sentential context. In Proceedingsof the Coling-ACL’98 Workshop “Usage of WordNet in Natural LanguageProcessing Systems”, pages 1–8, Montreal.

132 Bibliography

Stevenson, M. (1998). Extracting syntactic relations using heuristics. InKruijff-Korbayova, I., editor, Proceedings of the Third ESSLLI StudentSession, pages 248–256.

Stevenson, M. and Wilks, Y. (2001). The interaction of knowledge sourcesin word sense disambiguation. Computational Linguistics, 27(3):321–349.

Suarez, A. and Palomar, M. (2002). A maximum entropy-based word sensedisambiguation system. In Proceedings of the 19th International Confer-ence on Computational Linguistics (Coling 2002), pages 960–966, Taipei.

Tufis, D., Ion, R., and Ide, N. (2004). Word sense disambiguation as aWordNets’ validation method in BalkaNet. In Proceedings of the FourthInternational Conference on Language Resources and Evaluation (LREC2004), pages 1071–1074, Lisbon.

uit den Boogaart, P. (1975). Woordfrequenties in Geschreven and GesprokenNederlands. Oosthoek, Scheltema en Holkema, Utrecht.

Van Dale (1996). Van Dale Basiswoordenboek van de Nederlandse taal. VanDale, Utrecht.

van der Beek, L., Bouma, G., Malouf, R., and van Noord, G. (2002). TheAlpino dependency treebank. In Theune, M., Nijholt, A., and Hondorp,H., editors, Computational Linguistics in the Netherlands 2001, pages 8–22, Amsterdam. Rodopi.

van Noord, G. (2001). Robust parsing of word graphs. In Junqua, J.-C. andvan Noord, G., editors, Robustness in Language and Speech Technology,Text, Speech and Language Technology, pages 205–238. Kluwer AcademicPublishers, Dordrecht.

Veenstra, J., van den Bosch, A., Buchholz, S., Daelemans, W., and Zavrel,J. (2000). Memory-Based word sense disambiguation. Computers and theHumanities, 34(1–2):171–177.

Verspoor, C. M. (1997). Contextually-Dependent Lexical Semantics. PhDthesis, University of Edinburgh, Edinburgh.

Vossen, P., editor (1998). EuroWordNet—A Multilingual Database with Lex-ical Semantic Networks. Kluwer Academic Publishers, Dordrecht. Reprin-ted from Computers and the Humanities, 32(2–3), 1998.

Vossen, P., Bloksma, L., and Boersma, P. (1999a). The Dutch WordNet.Technical Report version 2, final, University of Amsterdam, Amsterdam.

Bibliography 133

Vossen, P., Peters, W., and Gonzalo, J. (1999b). Towards a universal indexof meaning. In Proceedings of the SIGLEX Workshop on StandardizingLexical Resources, pages 81–90, Maryland. ACL.

Wang, X. and Matsumoto, Y. (2004). Improving word sense disambiguationby pseudo samples. In Proceedings of the 1st International Joint Confer-ence on Natural Language Processing (IJCNLP-04), pages 233–240, SanyaCity, Hainan Island.

Weiss, S. and Kulikowski, C. (1991). Computer Systems that Learn. MorganKaufman, San Mateo.

Wilks, Y. and Stevenson, M. (1998). The grammar of sense: Using part-of-speech tags as a first step in sematic disambiguation. Natural LanguageEngineering, 4(2):135–144.

Witten, I. and Eibe, F. (2000). Data Mining: Practical Machine LearningTools and Techniques with Java Implementations. Morgan Kaufmann, SanFrancisco.

Yarowsky, D. (1992). Word-sense disambiguation using statistical modelsof Roget’s categories trained on large corpora. In Proceedings of the15th [sic] International Conference on Computational Linguistics (Coling1992), pages 454–460, Nantes.

Yarowsky, D. (1993). One sense per collocation. In Proceedings ARPA HumanLanguage Technology Workshop, Princeton.

Yarowsky, D. (1994). Decision lists for lexical ambiguity resolution: Applica-tion to accent restoration in Spanish and French. In 32th Annual Meetingof the Association for Computational Linguistics (ACL 1994), Las Cruces.

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling su-pervised methods. In 33th Annual Meeting of the Association for Compu-tational Linguistics (ACL 1995), pages 189–196, Cambridge.

Yarowsky, D. (2000). Hierarchical decision lists for word sense disambigu-ation. Computers and the Humanities, 34(1–2):179–186.

Zavrel, J. and Daelemans, W. (1999). Recent advances in Memory-Basedpart-of-speech tagging. In VI Simposio Internacional de CommunicacionSocial, pages 590–597, Santiago de Cuba.

134 Bibliography

Zwicky, A. and Sadock, J. (1975). Ambiguity tests and how to fail them. InKimball, J., editor, Syntax and Semantics, volume 4, pages 1–36. AcademicPress, New York.

Summary

The main research question we try to answer in the present thesis is whichlinguistic knowledge sources are most useful for word sense disambiguation(WSD), more specifically word sense disambiguation of Dutch. Therefore, thestructure of the thesis is based on the various levels of linguistic informationtested for WSD, including morphology, information on the syntactic classof a particular ambiguous word, and the syntactic structure of the entiresentence containing an ambiguous word. Each source of linguistic knowledgeis tested and evaluated individually in order to assess its value for WSD.Finally, combinations of knowledge sources are investigated and evaluated.

The goal of our project was to develop a tool which is able to automatic-ally determine the meaning of a particular ambiguous word in context, a socalled word sense disambiguation system. In order to achieve this, we makeuse of the information contained in the context. So we use the words sur-rounding the ambiguous word, and additional underlying information, suchas syntactic class and structure, to build a statistical language model. Thismodel is then used to determine the meaning of examples of that particularambiguous word in new contexts.

After a general introduction in chapter 1 to the subject of WSD andthe main research questions of the thesis, chapter 2 presents an overviewof prior research in WSD divided according to the possible approaches andthe information sources employed by the systems presented. By approachesor strategies we refer to the primary resource of information used to ex-tract information about the different senses of words, in contrast to inform-ation sources which refer to the type of knowledge used to find the correctsenses. Evaluation is also discussed, especially the Senseval WSD evalu-ation framework. The general approach chosen for our own work concludesthe introduction and literature overview.

In chapter 3 we show that the widely used technique of pseudowords toalleviate the need for hand annotated sense-tagged data is not a viable substi-tute for real ambiguous words. The main reason for this is that the “senses”of pseudowords consist of two (or more) clearly distinct words whereas real

135

136 Summary

ambiguous words usually have senses and subsenses that can be closely re-lated and are therefore more difficult to identify correctly, even for humans.

Then the experimental setup of the supervised corpus-based WSD systemis introduced in chapter 4, including a presentation of the corpus, the classi-fication algorithm used for disambiguation, as well as its implementation. Wealso present first results on the tuning data using a leave-one-out approachwith only “basic” features, such as the context surrounding the ambiguousword and its lemma. From these results, we can conclude that maximumentropy works well as a classification algorithm for WSD when compared tothe frequency baseline.

The results of the various experiments with these basic features decidewhich settings can best be used when more kinds of linguistic knowledgeare included in the system. It is investigated whether it is beneficial to usea frequency threshold with regard to the number of training instances ofeach ambiguous word found in the corpus. Our results show that maximumentropy (in combination with smoothing using Gaussian priors) is robustenough to deal with infrequent data and for this reason no threshold wasapplied. Moreover, various context sizes have been tested (only taking intoaccount the context words contained in the same sentence as the ambiguousword). We have found that a context of three words to the right and theleft perform better than bigger context sizes, confirming earlier findings inthe WSD literature. The last important result from chapter 4 is that usingcontext lemmas for generalization in combination with the relative positionof the context to the ambiguous word achieves better accuracy than contextwords and/or treating the context as a bag of words.

After the presentation of our WSD system for Dutch and the experimentalsetup, chapter 5 introduces a novel approach to building classifiers and, atthe same time, includes the first type of linguistic knowledge we investigated,namely morphological information. Instead of building a classifier for eachindividual word form (as has traditionally been done), we build classifiers onthe basis of the more general lemmas. An ambiguous word is then classifiedon the basis of its lemma.

Lemmatization allows for more compact and generalizable data by clus-tering all inflected forms of an ambiguous word together. The more inflectionin a language, the more lemmatization will help to compress and generalizethe data. Therefore, more training material is available to each classifier andthe resulting WSD system is smaller and more robust.

Our comparison of the lemma-based approach with the traditional wordform-based approach on the Dutch Senseval-2 test data set clearly showsthat using lemmatization significantly improves accuracy. Also, in compar-ison to earlier results with a Memory-Based WSD system, the lemma-based

Summary 137

approach performs equally well when using the same features. involving lesswork (no parameter optimization).

A second source of linguistic information that is tested for its value forWSD is part-of-speech (PoS) (chapter 6). The PoS of an ambiguous word it-self presents important information because the Dutch Senseval-2 data hadto be disambiguated morpho-syntactically as well as with regard to meaning.Two hypotheses are tested. On the one hand, it is investigated what effectthe quality of the PoS tagger used to tag the data has on the results of theWSD system including PoS information. The results confirm the expectationthat the most accurate PoS tagger (on a stand-alone task) also outperformsless accurate taggers in the application-oriented evaluation in our WSD sys-tem for Dutch. On the other hand, the experiments conducted allow us totest whether adding features explicitely encoding certain types of knowledgeincreases disambiguation accuracy. Our results show that this is definitelythe case.

We not only include the PoS of the ambiguous words, but also add thePoS of the context as an extra feature. Both sources of knowledge lead tosignificant improvements in the performance of the maximum entropy WSDsystem.

The third kind of information, and second kind of syntactic knowledge,that is included are dependency relations (described in chapter 7). Thisimplicitly tests whether deep linguistic knowledge is beneficial for a WSDapplication. After an overview of previous research in WSD using syntacticinformation, we introduce dependency relations and their merit for NLP,as well as Alpino, the dependency parser which was used to annotate thedata. Two different kinds of features including dependency relations areexperimented with. On the one hand, we test the configuration with twofeatures containing the name of all relations of a given ambiguous word.One feature contains the head relations while the other feature contains thedependent relations of the ambiguous word. On the other hand, we test theconfiguration with the same two features but this time combining the nameof the relation with the word completing the dependency triple.

The results in chapter 7 show that the addition of deep linguistic know-ledge to a statistical WSD system for Dutch results in a significant rise indisambiguation accuracy compared with all results on the tuning data dis-cussed so far. Dependency relations on their own already perform signific-antly better than the baseline, the combination of the lemma and PoS of theambiguous word together with dependency relations even outperforming themodel using context information. The best results (on the tuning data) at86.08% are achieved including the lemma, the PoS as well as the dependencyrelations linked to the ambiguous words in combination with the context

138 Summary

lemmas.In chapter 8 we report our results on the (unseen) Senseval-2 test data

with the best feature models determined during tuning. Several conclusionscan be drawn from the experiments conducted on the test data. First of all,adding structural syntactic information in the form of dependency relationsinstead of PoS of the context leads to an error-rate reduction of 8% forthe word form model. Furthermore, the lemma-based approach outperformsthe word form-based approach independently of the features included in themodel. The best overall performance on the test data is achieved usingthe lemma-based approach with the feature model including information onthe PoS of the ambiguous word form/lemma, its dependency relation labels,as well as the context lemmas. We can observe an error rate reduction of10% with regard to the lemma-based model including PoS in context, and areduction of 6% of errors with regard to the best model based on word forms.

Comparing our results on the test data to results obtained with a differentsystem, using Memory-Based Learning (MBL) as a classification algorithm(Hendrickx et al., 2002), both the word form-based classifiers and the lemma-based classifiers from our system produce higher accuracy. This is mainlydue to the fact that our feature model includes deep linguistic information inthe form of dependency relations whereas Hendrickx et al. include PoS of thecontext. The lemma-based model actually leads to an error rate reduction of10% if compared to the MBL WSD system. Our maximum entropy system isthus state-of-the-art for Dutch word sense disambiguation, showing that thecombination of building classifiers based on lemmas instead of word formsand including dependency relation labels as linguistic features (along withcontext lemmas) works best.

As a general conclusion, the results from our research suggest that in thecase of a statistical disambiguation algorithm the combination of several or-thogonal linguistic features yields the best results. This means that WSD forDutch profits from various sources of linguistic knowledge. Thus, there is nota single best linguistic knowledge source, but rather a number of (carefully)selected features that work best in combination.

Especially the addition of deep linguistic knowledge greatly improves ac-curacy. In combination with an approach taking advantage of morphologicalinformation, the lemma-based approach, the best results for WSD of Dutchon the Senseval-2 data set are obtained. Our system achieves significantlyhigher disambiguation accuracy than any results for Dutch that have beenreported in the literature up to now.

Samenvatting

De belangrijkste onderzoeksvraag waarop het werk in dit proefschrift een ant-woord probeert te vinden, is welke typen taalkundige informatie het nuttigstzijn voor de lexicale desambiguatie van (Nederlandse) woorden. De structuurvan dit proefschrift reflecteert dan ook de verschillende niveaus van taalkun-dige informatie die getoetst zijn op hun nut voor lexicale desambiguatie. Dezeniveaus zijn morfologie, de woordsoort van het ambigue woord en de syntac-tische structuur van de zin waarin het woord voorkomt. Elk type taalkundigekennis wordt individueel getoetst en geevalueerd om de waarde ervan voorlexicale desambiguatie vast te stellen. Uiteindelijk worden ook combinatiesvan verschillende typen taalkundige kennis getoetst en geevalueerd.

Het doel van dit project was de ontwikkeling van een module die automa-tisch de juiste betekenis kan toewijzen aan een ambigu woord in een bepaaldecontext. Dit wordt ook wel word sense disambiguation (WSD) genoemd. Debetekenistoekenning vindt plaats op basis van de informatie uit de contextvan het ambigue woord. Deze informatie kan zowel bestaan uit de woordenrondom het te desambigueren woord alsook extra informatie zoals syntacti-sche klasse of structuur en met deze kennis wordt een statistisch taalmodelgebouwd. Het model voorspelt vervolgens voor een bepaald ambigu woordin een nieuwe context de juiste betekenis.

Na de algemene inleiding in WSD en een overzicht van de belangrijksteonderzoeksvragen in hoofdstuk 1 geeft hoofdstuk 2 een overzicht van eerderonderzoek op het gebied van WSD, opgesplitst naar de informatiebronnenen de informatietypen die de gepresenteerde systemen gebruiken. Met in-formatiebronnen worden de primaire bronnen bedoeld die gebruikt wordenom informatie over de verschillende betekenissen van woorden te extraheren,terwijl informatietypen verwijzen naar de verschillende soorten taalkundigekennis die de systemen gebruiken om de juiste betekenis te vinden. Verderkomt in dit hoofdstuk ook de evaluatiemethode zelf aan bod, en in het bijzon-der de Senseval WSD evaluatierondes. Een beschrijving van de algemeneaanpak voor dit onderzoek sluit de introductie en het literatuuroverzicht af.

Hoofdstuk 3 laat zien dat de inzet van zogenaamde pseudowoorden, die

139

140 Samenvatting

vaak gebruikt worden om de behoefte aan handmatig met betekenis gean-noteerde data te omzeilen, geen geldige vervanging is voor data van echteambigue woorden. De belangrijkste reden hiervoor is dat de “betekenissen”van pseudowoorden uit twee (of meer) duidelijk van elkaar gescheiden woor-den bestaan, terwijl ambigue woorden in werkelijkheid over het algemeenbetekenissen en onderbetekenissen hebben die in nauwe relatie tot elkaarstaan en die om deze reden moeilijker correct van elkaar te onderscheidenzijn, ook voor mensen.

In hoofdstuk 4 wordt de experimentele opzet van het supervised en corpus-gebaseerde WSD-systeem geıntroduceerd. Deze introductie omvat ondermeer een beschrijving van het corpus, het classificatie-algoritme dat gebruiktwordt voor desambiguatie en de implementatie hiervan. Ook worden deeerste resultaten op de tuning data met een leave-one-out aanpak gepresen-teerd waarbij alleen minimale features, zoals de context rond om het ambi-gue woord en het bijbehorende lemma, gebruikt worden. Op basis van dezeresultaten concluderen we dat maximale entropie (MaxEnt) als classificatie-algoritme voor WSD beter presteert dan de op frequentie gebaseerde baseline.

De resultaten van de verschillende experimenten met de minimale fea-tures bepalen welke instellingen het beste gebruikt kunnen worden wanneerer meer soorten taalkundige kennis aan het systeem worden toegevoegd. Er ismet name onderzocht of het gebruik van een drempelwaarde voor het aantaltrainingsinstanties van elk ambigu woord in het corpus een voordeel ople-vert. De resultaten laten zien dat MaxEnt (in combinatie met smoothingmet Gaussian priors) robuust genoeg is om infrequente data te verwerken.Om deze reden gebruiken we in dit onderzoek geen drempelwaarde voor defrequentie. Bovendien hebben we het effect getoetst van verschillende con-textgrootten (alleen contextwoorden in dezelfde zin als het ambigue woordworden meegenomen). Uit deze experimenten blijkt dat een context vandrie woorden links en rechts van het ambigue woord tot een beter resul-taat leidt dan grotere contexten, wat eerdere resultaten in de literatuur overWSD bekrachtigt. Het laatste belangrijke resultaat van hoofdstuk 4 is dat decombinatie van contextlemmas gecombineerd met de relatieve positie van decontext ten opzichte van het ambigue woord beter werkt dan contextwoordenen/of de context als een bag-of-words te zien.

Na de algemene introductie van het WSD-systeem voor het Nederlandsen de experimentele opzet, introduceert hoofdstuk 5 een aanpak voor hetbouwen van een classifier die gebruik maakt van een eerste type taalkun-dige kennis, namelijk morfologische informatie. In plaats van een classifier temaken voor iedere individuele woordvorm, worden nu classifiers geconstru-eerd voor de meer algemene lemma’s. Een ambigu woord wordt vervolgensgeclassificeerd op basis van zijn lemma.

Samenvatting 141

Lemmatisering leidt tot een compactere en meer algemene informatiere-presentatie door alle geınflecteerde vormen van een ambigu woord samen tegroeperen. Meer inflectie in een taal zal zorgen voor een grotere compressieen generalisatie van de data. De toepassing van lemmatisering zorgt ervoordat elke classifier meer trainingsmateriaal tot zijn beschikking heeft en het re-sulterende WSD-systeem compacter is. Door te abstraheren van woordvormwordt het systeem bovendien robuuster.

Een vergelijking tussen de lemma-gebaseerde aanpak en de traditionele opwoordvorm gebaseerde aanpak op de Nederlandse Senseval-2 testdata laatduidelijk zien dat het gebruik van lemmatisering de accuratesse verbetert.De eerdere resultaten van een op Memory-Based Learning (MBL) gebaseerdWSD-systeem leveren dezelfde resultaten als de op lemma’s gebaseerde aan-pak wanneer dezelfde features gebruikt worden. Een groot verschil is dat ophet systeem met lemmatisering (nog) geen parameter optimalisatie toegepastis.

Een tweede type taalkundige informatie die op zijn waarde voor WSD ge-toetst wordt is part-of-speech (PoS), oftewel de syntactische klasse of woord-soort van een woord (hoofdstuk 6). De PoS van een potentieel ambigu woordbevat belangrijke informatie, omdat de Nederlandse Senseval-2 data te-gelijkertijd morfo-syntactisch en lexicaal-semantisch gedesambigueerd moetworden. Twee hypothesen worden getest. Aan de ene kant is gekeken naarde invloed van de kwaliteit van de PoS-tagger op de accuratesse van hetWSD-systeem met PoS-informatie. De resultaten bekrachtigen de verwach-ting dat de PoS-tagger die op zichzelf de hoogste accuratesse behaalt, ookin een applicatie-gerichte evaluatie beter presteert dan minder accurate PoS-taggers. Aan de andere kant is onderzocht of het expliciet toevoegen vanfeatures die een bepaald soort kennis coderen de desambiguatieaccuratessedoet toenemen of dat deze informatie reeds impliciet in het model aanwe-zig was. De resultaten laten duidelijk zien dat het expliciet toevoegen vanbepaalde features het systeem verbetert.

Enerzijds is het effect gemeten van features voor de woordsoort van hetambigue woord zelf, en anderzijds het effect van features voor de syntactischecategorieen van de woorden in de context. Beide kennisbronnen leiden totsignificante verbeteringen van de prestatie van het op MaxEnt gebaseerdeWSD-systeem.

Het derde type informatie, de tweede soort syntactische kennis, die ge-bruikt wordt voor desambiguatie is informatie over syntactische afhankelijk-heidsrelaties, ook wel dependencies genoemd (beschreven in hoofdstuk 7).De impliciete onderzoeksvraag is of diepe taalkundige kennis helpt in eenWSD-applicatie. Na een overzicht van eerder onderzoek met WSD-systemendie gebruik maken van syntactische informatie, worden dependencyrelaties

142 Samenvatting

en hun invloed op het gebied van NLP geıntroduceerd, alsmede Alpino, dedependencyparser die gebruikt werd om de data te annoteren. Twee ver-schillende feature-instellingen met dependencyrelaties worden gebruikt. Aande ene kant testen we een configuratie met twee features, waarbij de featuresalleen de namen van de relaties van het ambigue woord bevatten. Het enefeature bevat de head relaties, terwijl het andere feature dependent relatiesvan het ambigue woord bevat. Aan de andere kant experimenteren we meteen configuratie met diezelfde twee features, maar nu met zowel de naam vande relatie als het woord dat door middel van deze relatie met het ambiguewoord verbonden is.

De resultaten in hoofdstuk 7 laten zien dat het toevoegen van diepe taal-kundige kennis aan een statistisch WSD-systeem voor het Nederlands een sig-nificante verbetering van de desambiguatieaccuratesse oplevert ten opzichtevan alle resultaten die tot nu toe op de tuning data zijn bereikt. Enkel hetgebruik van dependencyrelaties leidt al tot een significant beter resultaatdan de baseline en de combinatie van het lemma en de PoS van het ambi-gue woord samen met dependencyrelaties werken zelfs beter dan het modelmet contextinformatie. De beste resultaten (op de tuning data) met 86.6%worden bereikt met het lemma, de PoS en de dependencyrelaties van hetambigue woord in combinatie met de lemmas in de context.

In hoofdstuk 8 worden de resultaten van de beste feature-modellen (opbasis van de tuningexperimenten) op de (ongeziene) Senseval-2 testdata be-sproken. Uit de experimenten op de testdata kunnen verschillende conclusiesgetrokken worden. Allereerst leidt het toevoegen van structurele syntactischeinformatie in de vorm van dependencyrelaties in plaats van PoS van de con-text tot een error rate reductie van 8% voor het op woordvorm gebaseerdemodel. Bovendien werkt de op lemma gebaseerde aanpak beter dan de opwoordvorm gebaseerde aanpak, onafhankelijk van de features die in het mo-del toegevoegt worden. De beste resultaten op de testdata worden bereiktmet lemmatisering gecombineerd met het featuremodel met informatie overde woordsoort van de ambigue woordvorm of het ambigue lemma, de depen-dencylabels en de contextlemmas. Dit leidt tot een reductie van de errorrate van 10% met betrekking tot het lemmamodel met PoS van de contexten een foutreductie van 6% met betrekking tot het beste model gebaseerd opwoordvormen.

Als de resultaten op de testdata vergeleken worden met de resultaten meteen ander bestaand systeem, dat MBL als classificatie algorithme gebruikt(Hendrickx et al., 2002), zien we dat zowel de op woordvorm gebaseerdeclassifiers als ook de op lemma gebaseerde classifiers leiden tot een hogereaccuratesse. Dit heeft vooral te maken met het feit dat ons featuremodelo.a. diepe taalkundige informatie in de vorm van dependencyrelaties bevat

Samenvatting 143

terwijl het systeem van Hendrickx et al. PoS van de context gebruikt. Hetlemmamodel leidt tot een error rate reductie van 10% in vergelijking methet WSD-systeem op basis van MBL. Het op MaxEnt gebaseerde systeem isdus state-of-the-art voor Nederlandse WSD en toont daarmee aan dat eencombinatie van classifiers op basis van lemmas in plaats van woordvormenenerzijds en het gebruik van dependencylabels als taalkundige features (sa-men met contextlemmas) anderzijds de beste resultaten oplevert.

Bij wijze van algemene conclusie suggereren de resultaten van dit on-derzoek dat voor een statistisch desambiguatiealgoritme de combinatie vanverscheidene orthogonale taalkundige features tot de beste resultaten leidt.Dit betekent dat WSD voor het Nederlands van verschillende typen taalkun-dige kennis profijt heeft. Het is niet mogelijk een beste type taalkundigekennis aan te wijzen, maar een aantal (zorgvuldig geselecteerde) features diein combinatie het beste werken.

Vooral het toevoegen van diepe taalkundige kennis verbetert de accu-ratesse aanzienlijk. In combinatie met een aanpak die het voordeel van hetgebruik van morfologische informatie in aanmerking neemt, het lemmamodel,worden de beste resultaten voor WSD van het Nederlands op de Senseval-2dataset bereikt. Dit systeem werkt significant beter dan alle tot op heden inde literatuur gepubliceerde resultaten.

144

Groningen dissertations in linguistics

Grodil

1. Henriette de Swart (1991). Adverbs of Quantification: A GeneralizedQuantifier Approach.

2. Eric Hoekstra (1991). Licensing Conditions on Phrase Structure.

3. Dicky Gilbers (1992). Phonological Networks. A Theory of SegmentRepresentation.

4. Helen de Hoop (1992). Case Configuration and Noun Phrase Interpret-ation.

5. Gosse Bouma (1993). Nonmonotonicity and Categorial UnificationGrammar.

6. Peter Blok (1993). The Interpretation of Focus: an epistemic approachto pragmatics.

7. Roelien Bastiaanse (1993). Studies in Aphasia.

8. Bert Bos (1993). Rapid User Interface Development with the ScriptLanguage Gist.

9. Wim Kosmeijer (1993). Barriers and Licensing.

10. Jan-Wouter Zwart (1993). Dutch Syntax: A Minimalist Approach.

11. Mark Kas (1993). Essays on Boolean Functions and Negative Polarity.

12. Ton van der Wouden (1994). Negative Contexts.

13. Joop Houtman (1994). Coordination and Constituency: A Study inCategorial Grammar.

14. Petra Hendriks (1995). Comparatives and Categorial Grammar.

145

146

15. Maarten de Wind (1995). Inversion in French.

16. Jelly Julia de Jong (1996). The Case of Bound Pronouns in PeripheralRomance.

17. Sjoukje van der Wal (1996). Negative Polarity Items and Negation:Tandem Acquisition.

18. Anastasia Giannakidou (1997). The Landscape of Polarity Items.

19. Karen Lattewitz (1997). Adjacency in Dutch and German.

20. Edith Kaan (1997). Processing Subject-Object Ambiguities in Dutch.

21. Henny Klein (1997). Adverbs of Degree in Dutch.

22. Leonie Bosveld-de Smet (1998). On Mass and Plural Quantification:The Case of French ‘des’/‘du’-NPs.

23. Rita Landeweerd (1998). Discourse Semantics of Perspective and Tem-poral Structure.

24. Mettina Veenstra (1998). Formalizing the Minimalist Program.

25. Roel Jonkers (1998). Comprehension and Production of Verbs in Apha-sic Speakers.

26. Erik F. Tjong Kim Sang (1998). Machine Learning of Phonotactics.

27. Paulien Rijkhoek (1998). On Degree Phrases and Result Clauses.

28. Jan de Jong (1999). Specific Language Impairment in Dutch: Inflec-tional Morphology and Argument Structure.

29. H. Wee (1999). Definite Focus.

30. Eun-Hee Lee (2000). Dynamic and Stative Information in TemporalReasoning: Korean Tense and Aspect in Discourse.

31. Ivilin Stoianov (2001). Connectionist Lexical Processing.

32. Klarien van der Linde (2001). Sonority Substitutions.

33. Monique Lamers (2001). Sentence Processing: Using Syntactic, Se-mantic, and Thematic Information.

34. Shalom Zuckerman (2001). The Acquisition of “Optional” Movement.

35. Rob Koeling (2001). Dialogue-Based Disambiguation: Using DialogueStatus to Improve Speech Understanding.

147

36. Esther Ruigendijk(2002). Case Assignment in Agrammatism: a Cross-linguistic Study.

37. Tony Mullen (2002). An Investigation into Compositional Features andFeature Merging for Maximum Entropy-Based Parse Selection.

38. Nanette Bienfait (2002). Grammatica-onderwijs aan allochtone jong-eren.

39. Dirk-Bart den Ouden (2002). Phonology in Aphasia: Syllables andSegments in Level-specific Deficits.

40. Rienk Withaar (2002). The Role of the Phonological Loop in SentenceComprehension.

41. Kim Sauter (2002). Transfer and Access to Universal Grammar inAdult Second Language Acquisition.

42. Laura Sabourin (2003). Grammatical Gender and Second LanguageProcessing: An ERP Study.

43. Hein van Schie (2003). Visual Semantics.

44. Lilia Schurcks-Grozeva (2003). Binding and Bulgarian.

45. Stasinos Konstantopoulos (2003). Using ILP to Learn Local LinguisticStructures.

46. Wilbert Heeringa (2004). Measuring Dialect Pronunciation Differencesusing Levenshtein Distance.

47. Wouter Jansen (2004). Laryngeal Contrast and Phonetic Voicing: ALaboratory Phonology Approach to English, Hungarian and Dutch.

48. Judith Rispens (2004). Syntactic and Phonological Processing in De-velopmental Dyslexia.

49. Danielle Bougaıre (2004). L’approche communicative des campagnesde sensibilisation en sante publique au Burkina Faso: les cas de laplanification familiale, du sida et de l’excision.

50. Tanja Gaustad (2004). Linguistic Knowledge and Word Sense Disam-biguation.

Grodil

Secretary of the Department of General LinguisticsPostbus 7169700 AS GroningenThe Netherlands

Date post:	19-Dec-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

University of Groningen Linguistic Knowledge and Word ...

Documents