+ All Categories
Home > Documents > Automatic Error Detection in Annotated...

Automatic Error Detection in Annotated...

Date post: 19-Apr-2018
Category:
Upload: buinhi
View: 233 times
Download: 4 times
Share this document with a friend
61
Automatic Error Detection in Annotated Corpora Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science with specialization in NLP by ANNAMANENI NARENDRA 200902038 [email protected] International Institute of Information Technology Hyderabad - 500 032, INDIA July 2017
Transcript

Automatic Error Detection in Annotated Corpora

Thesis submitted in partial fulfillmentof the requirements for the degree of

MS by Researchin

Computer Science with specialization in NLP

by

ANNAMANENI NARENDRA200902038

[email protected]

International Institute of Information TechnologyHyderabad - 500 032, INDIA

July 2017

Copyright c© Annamaneni Narendra, 2017

All Rights Reserved

International Institute of Information TechnologyHyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled Automatic Error Detection in Annotated Cor-pora by ANNAMANENI NARENDRA, has been carried out under my supervision and is not submittedelsewhere for a degree.

Date Adviser: Prof. Dipti Misra Sharma

To my Family

Acknowledgments

Firstly, I would like to express my deepest gratitude for my adviser Prof. Dipti Misra Sharma, withoutwhom this thesis would not have been possible. I must sincerely thank her for constantly supportingme. She gave me a lot of freedom to work on the topics i am interested in. The way she works in the labalways inspired me. She is a kind hearted, loving person and great human being. Always our meetingsstarts with a smile which relives me a lot. She gave me the direction to my thesis with her deep insightsin computational linguistics. I would like thank Dr.Radhika Mamidi mam for her friendly attitude andsuggestions in my work. Prof. Rajeev Sangal is a great inspiration to me with his deep insights in NLPand his approach towards the life. He has the great teaching abilities and I really enjoyed attending hislectures.

In the journey of my work , Riyaz is a great mentor without whom this work wouldn’t have beenfinished. He really encouraged my work and whenever i face problem , he is always been there tosupport. I have learned a lot from him and he is humble and approachable person. He is the greatcontributor of my early published works which gave me a lot of confidence. Special thanks to ArjunRedddy, Dr.Ravi Jampani , Dr.Theja Thulabndhula and Himani mam for helping me to finish my thesis.I would like to thank Vishnu who is my friend and lab mate who always helped me. Special thanks toAnjumam who helped me in the annotation process. I would like to acknowledge all my lab mentorsJayendra, Deepk and Puneeth for their help in the lab.

I thank the LTRC staff, who made my research journey at LTRC most comfortable. Srinivas sir forwonderful technical help, Rambabu sir for administrative issues for lab related issues and Mr. LakshmiNarayan for general issues. Thanks to Appaji sir, Kishore and Byrappa sir for making my life easier inIIIT. My friends played a special role in my life. Thanks to Saikrishna, Rajeev, Kapil, Navya, Tushar,TVK, Pradyumna, Rahul, Kiran, Trinath, and Bhargav for their support during the days of my hostel lifein IIIT. I would like to acknowledge my friends Sridhar, Dinesh, and Raja Subramanian who are therewith me during my professional life.

Finally, I should acknowledge my great parents Sreenivasulu and Suseela without their support iwouldn’t have grown to achieve a great mile stone in my life. My teachers Narayana Swamy andVasudeva Reddy who paved the road to reach the great institution IIIT. I would like to acknowledgespecial persons in my family life Sravani, Puneetha, and my great grand mother Sarojamma.

v

Abstract

Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic andsemantic levels for each sentence. Annotated corpora play a crucial role in many applications of naturallanguage processing (NLP). Error free and consistent annotated corpora is vital for these applications.Creating annotated corpora is an expensive and time consuming process. Errors or anomalies creep indue to human errors and sometimes because of multiple interpretations of the annotation guidelines.Maintaining the quality of the annotations is a challenging problem. This is because validating theannotated corpora and correcting these errors manually is an expensive and time consuming process.In particular, the validation process needs an expert’s time to detect and correct these errors, whichis expensive. Hence, they need intelligent tools to automatically detect possible instances of errors inannotated corpora which they can validate quickly. Treebank annotation involves encoding informationat POS, morph, chunk and dependency levels. Annotation requires a domain specific understanding ofthe language and dependency guidelines. Further, to validate the annotated corpora, we need experts oflanguage and annotation guidelines.

In this work, we address the problem of treebank validation and proposed novel approaches to detecterrors automatically. To be specific, we address the issues at dependency level in the annotation processwhich is more vulnerable to errors due to complex rules in the dependency annotation schema. In oursolution, we used ensembling methods on the parsers outputs. We hypothesize that the annotation andvalidation process should go in parallel rather than waiting for the entire corpus to be created. Ourtool provides annotators error instances or inconsistent cases, so that they can clear the ambiguities intheir understanding by reflecting on these small numbers of error instances. This process helps in earlyunderstanding of the errors committed in the annotation process. We also address the problem of skeweddata sets, which is common in Indian languages by utilizing word embedding. Later, we attempt to buildtools to correct the dependency errors automatically.

Our work majorly investigated the error detection using dependency parsers and able to detect errorswith an F-score values 71.18% and 42.19% respectively for Hindi and Telugu treebanks available. Ourwork includes some preliminary attempts to correct the errors automatically and we have increased thebaseline precision of corpus from 88.59% to 92.29% for Hindi treebank.

vi

Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Dependency Treebank Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Paninian Grammar Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Dependency Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Treebank Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.3.1 SSF (Shakthi Standard Format) . . . . . . . . . . . . . . . . . . . . 61.1.3.2 CONLL Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.4 Telugu Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.1.5 Anomalies in Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Contributions of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Dependency Parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Dependency parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 MaltParser (A Transition-based Dependency Parser) . . . . . . . . . . . . . . . . . . . 163.3 MSTParser (A Graph-based Dependency Parser) . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Parsing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 Learning and Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 TurboParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Confidence score computation for parsers . . . . . . . . . . . . . . . . . . . . . . . . 213.5.1 MaltParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5.2 MSTParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5.3 TurboParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Comparison of Malt, MST and Turbo Parsers . . . . . . . . . . . . . . . . . . . . . . 22

4 Error detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1 Error detection in annotated corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.1 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1.1 Simple ensemble approach or Voting . . . . . . . . . . . . . . . . . 25

4.1.1.1.1 Error classification . . . . . . . . . . . . . . . . . . . . . 254.1.1.2 Extended simple ensemble approach or ranking . . . . . . . . . . . 264.1.1.3 Weighted voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vii

viii CONTENTS

4.1.1.4 Machine learning classifiers . . . . . . . . . . . . . . . . . . . . . . 274.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1.3 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.3.1 Simple ensemble method . . . . . . . . . . . . . . . . . . . . . . . 304.1.4 Ranking Based on Confidence Scores . . . . . . . . . . . . . . . . . . . . . . 324.1.5 Experiments based on Machine Learning methods . . . . . . . . . . . . . . . 324.1.6 Comparison with the prior art . . . . . . . . . . . . . . . . . . . . . . . . . . 334.1.7 Conclusions and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Error Correction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.1 Correction based on parsers outputs . . . . . . . . . . . . . . . . . . . . . . . 365.2.2 Learning Models Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Experiments and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3.1 Correction based on parsers outputs . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Appendix A: TagSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.1 POS Tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43A.2 Chunk Tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44A.3 Dependency Tag Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

List of Figures

Figure Page

1.1 Levels in Paninian Grammar Framework . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Phrase Structure and Dependency Structure respectively for the English sentence “Thebig dog chased down the cat ” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Sentence from Penn Treebank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Sequence of steps in dependency parsing using MaltParser . . . . . . . . . . . . . . . 173.4 Features used by MSTParser. where xi is the head and xj the modifier . . . . . . . . . 20

4.1 Algorithm to prioritize the errors into 4 classes . . . . . . . . . . . . . . . . . . . . . 264.2 Algorithm to find label based on weighted scores . . . . . . . . . . . . . . . . . . . . 274.3 Objective function of CSSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Errors Ranking based on the Confidence score . . . . . . . . . . . . . . . . . . . . . . 30

ix

List of Tables

Table Page

4.1 Treebank Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Simple strategy Error detection results . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 Parsing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Ranking based on the confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5 CSSVM ML classifier results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.6 Prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 Error correction results based on parsed outputs . . . . . . . . . . . . . . . . . . . . . 385.2 Error correction results based on learning models . . . . . . . . . . . . . . . . . . . . 38

x

Chapter 1

Introduction

Annotated corpus is a grammatically tagged text which provides various levels of information atsyntactic and semantic levels in the context of natural language processing (NLP). They are createdmanually by assigning tags to text at various levels such as parts of speech (POS), chunk, morpholog-ical and dependency by annotators who have linguistic knowledge. Alternatively, in semi-automaticannotation semantic and syntactic information is marked by an automatic tool and then validated by anannotator.

The need for annotated corpora is widely acknowledged in the field of computational linguistics.The growing importance of annotated corpora for basic as well as advanced NLP applications in the lastdecade has led to an exponential increase in the creation of these linguistic resources for a wide rangeof languages. Linguistic knowledge encoded in the annotated corpora is the only source of knowledgefor various downstream tools to learn and interpret various contexts from text and act accordingly. Thisknowledge is used to create various crucial NLP applications starting from spell checkers to machinetranslation. Most of these applications (sentence generation, machine translation,... etc) need highquality annotated corpora. Even a few annotation errors affect the performance of these applicationspoorly. For example, [36] identifies some specific types of inconsistently annotated coordinate phrases.She shows that they cause problems for training and testing. So she cleans up these cases to improveperformance. When linguistic annotations at lower level as part-of-speech or syntactic annotation isused as input for further layers of annotation or processing, even a small set of errors can significantlyaffect a higher layer. In another case, [35] compares different Arabic parsing systems and find that theone which initially had better results was actually worse, when treebank errors were accounted for. It iscrucial that the annotated corpora are free of anomalies (errors) and inconsistencies.

The process of creating annotated corpora is either completely manual or by correcting automaticallygenerated annotations. Both these methods involve human intervention which makes them vulnerablefor errors. In general, two major reasons make the annotated corpora error prone. First, different levelsof understanding and interpretation of the annotation guidelines by various annotators. Second, sinceannotation is a time consuming process, human errors creep in unconsciously due to fatigue or boredomof annotators.

1

Manual validation of annotated corpora is an obvious choice to improve their quality. To validatethe corpora, each instance has to be carefully reviewed by an expert who can detect errors made bythe annotators. But in general, an expert’s time is expensive and it is difficult for an expert to spend ahuge amount of time on manual validation. To address this problem, experts need tools that can assistthem by automatically detecting the possible instances of an error in the annotated corpora, so that theycan review possible error instances quickly and correct them if there is indeed an error. These toolsshould have higher recall so that detected instances cover all the error instances. However, we cantrade off a bit on precision so that an expert can quickly reject the false positives as long as the overallnumber of detected error instances in the corpora are just a fraction of the annotated corpus. In thiswork, we proposed various approaches to address the problem of automatic error detection. We haveconducted experiments on Hindi and Telugu dependency annotated corpora to show the effectiveness ofour proposed approaches. Nuances of dependency annotation and annotated corpora are explained in thefollowing sections. In this document, we use the terms treebank and annotated corpus interchangeably,assuming that both words refer to the same semantic meaning. The following section presents details ofdependency annotation and representation of the annotated corpora.

1.1 Dependency Treebank Annotation

In this section, we discuss various levels of information in the dependency treebanks created forIndian languages and then we enlist the details of the Telugu treebank.

Dependency treebanks of Indian languages like Hindi and Telugu are developed based on the PaninianGrammar Framework, which we discuss in the following subsection. Dependency annotation majorlyinvolves encoding linguistic information (at various levels of granularity) into a machine readable for-mat.

1.1.1 Paninian Grammar Framework

The Paninian Grammar Framework [9] is an early attempt to understand the process of languagecommunication for Sanskrit. A sentence is a way to express the thought in the mind of a speaker/writersuch that the listener/reader is able to receive the information that the speaker/writer intended to convey.The Paninian Grammar Framework includes a rule system (grammar), to theoretically explain naturallanguage communication. Indian languages are morphologically rich and such inflectional languageshave a relatively free word order. In free word order languages, position may not provide much in-formation compared to fixed word order languages where position is a primary source of information.A context free grammar framework used to analyze the positional languages is no longer valid for theanalysis of free word order languages[63]. [54] argued about free word order and it’s relevance to Indianlanguages, referring Malayalam. Later, [55] address the set of issues surrounding multidimensionalityin relation to word order in South Asian languages.

2

The Paninian Grammar Framework primarily considers information from the constituents of thesentences and relations among them. Verbal root (dhatu) is the major element in the semantic model ofthe Paninian Grammar Framework. It denotes actions consisting of an activity (vyapara) and a result(phala). Result is the state which is reached on completion of an activity. An activity consists of actionscarried out by participants (karakas) involved in the sentence. Vivaksha or the viewpoint of the speakeris also an important element which denotes the speaker’s intention towards the activity [9].

Figure 1.1 Levels in Paninian Grammar Framework

Figure 1.1 shows the semantic model of the Paninian Grammar Framework which has four levels.The surface level is the uttered or the written sentence. Next one is the, vibhakti level at which there arelocal word groups together with case endings, preposition or post-position markers. The vibhakti levelabstracts away many minor (including orthographic and idiosyncratic) differences among languages.The karaka level on top of the vibhakti level, includes karaka relations and a few additional relationssuch as taadaarthya (or purpose). The topmost level relates to what the speaker has in his mind. Thismay be considered to be the ultimate meaning level that the speaker wants to convey. One can imagineseveral levels between the karaka and the ultimate (semantic) level, each containing more semanticinformation. Thus, the karaka level is one in a series of levels, that has a relationship to semantics onthe one hand and syntactic information on the other [9].

Karaka relations are central to the Paninian Grammar Framework. They are syntactico-semanticrelations between the verbals and other constituents in a sentence. They capture a certain level ofsemantics syntactically reflected in the surface form of the sentence. There are six different types ofkaraka relations. These relations are not enough to capture all the actions and the participants but theyare sufficient to mark relations between a single verb and the constituents involved in the action. Thus,karaka relations specify the maximum necessary information relative to the verb. Six karaka relationsare as follows.

3

1. k1: karta ( akin to subject and agent, but different from them): the most independent participantin the action

2. k2: karma (roughly the theme or object): the one most desired by the karta

3. k3: karana (instrument): which is essential for the action to take place

4. k4: sampradaan (beneficiary): recipient or beneficiary of the action

5. k5: apaadaan (source): movement away or separation from a source

6. k7: adhikarana (location): location of the action in time and space

The Paninian grammar framework treats a sentence as a series of modifier-modified relations. Asentence is supposed to have a primary modified (the root of the dependency tree), which is typicallythe main verb of the sentence[8]. Other constituents of the sentence which modify the verb participate inthe action specified by the verb. In a dependency analysis of a sentence based on the Paninian GrammarFramework, child-parent relations are denoted by the edge between them. The type of the karaka relationis given as the edge label which specifies the relation between the child and the parent. Apart from thesix karaka relations, [8] and [13] introduced more fine grained labels in the framework to encode moreinformation. These labels are enlisted in Appendix A.3. Using information such as karakas (basedon some vibhaktis post-positions )and TAM (tense, aspect and modality) of the main verb seem verywell suited for handling free word order languages. More details and the issues related to the PaninianGrammar Framework are addressed in [10] and [62]. In the following section, we elaborate on thedependency annotation of the Telugu language using the Paninain Grammar Framework.

1.1.2 Dependency Annotation

A dependency annotation schema for Indian languages like Hindi, Telugu, Bangla and Tamil ..etc isproposed based on the Paninian grammar framework. Treebank is a multilayered information encodedresource. It means that different levels of information are encoded in different forms as mentionedearlier. These levels include morpho-syntactic (morph, POS tags), syntactico-semantic relations (de-pendency relations) and the lexical semantic level. In the following section, we describe the variouslevels of information in a treebank.

1. Part of speech (POS) tags: Parts of speech tagging is the process of marking up a word to aparticular word categories (parts of speech tags). POS tags provide syntactic information relatedto words in a sentence. Currently, we use the POS tags listed in the POS and chunk annotationguidelines [12] which are listed in Appendix A.1. POS taggers are available for many languageswhich can assign a tag to a word automatically. Annotators can manually validate the tags lateron.

4

2. Morph information: Morphological information corresponds to syntactic information encoded inthe treebank. Morph information consists of eight feature attributes which are root, category, gen-der, number, person, case, post position and tense aspect modality (TAM). Some of these featureattributes are applicable for particular POS tags. Tense aspect and modality(TAM) information isannotated only for verbs; likewise gender and person information is meaningful for nouns. Telugutreebank is a morphologically rich language compared to many other Indian languages. Morphfeatures are detailed in section 1.1.3.2. Automatic morphological analyzers are available whichgenerate morph analysis for each word automatically and then annotators validate the generatedmorph information.

3. Chunk tags: Chunk is a phrasal level abstraction, a word or a group of words that captures in-formation without distorting the internal dependencies. Chunk boundaries are marked after as-signing the POS tags. Chunk tags used in our annotation schema are listed under Appendix A.2.SSFParser 1 generates the chunk boundaries for a given sentence automatically, which are latervalidated by the annotators.

4. Dependency labels: Dependency labels contain a major source of syntactico-semantic informa-tion. Dependency annotation is a time consuming and intensive process. List of dependency labelsand guidelines are identified based on the Paninian Grammar Framework to create a dependencystructure and to assign inter chunk level dependency labels [9]. A complete set of labels(finegrained and coarse grained) is listed under Appendix A.3.

5. Other tags: Apart from POS, morph, chunk and dependency level tags, additional tags are de-fined at sentence level to distinguish various formulations of the sentence. Sentence type featureattributes are marked to distinguish between sentence types such as interrogative, imperative anddeclarative. Voice type marks whether the sentence is a passive or active construction of thesentence.

1.1.3 Treebank Representations

In this section, we discuss the representation of a treebank based on the dependency annotationscheme described in section 1.1.4. Indian language dependency corpora adopt the Shakti Standard For-mat(SSF) proposed in [11]. SSF is the simple human readable and machine understandable formatwhich can be extended to include any features encoding the linguistic information at different granu-larity levels. This format can be converted to CONLL, a standard format detailed in section 1.1.3.2for interchangeability between many NLP tools. In both the formats, lexicons or words of the examplesentence are represented in wx-notation 2. The wx-notation is a transliteration scheme to denote a scriptin Roman script. It defines a standard for the representation of Indian languages in Roman script. These

1http : //ltrc.iiit.ac.in/showfile.php?filename = downloads/shallowparser.php2http://www.cfilt.iitb.ac.in/Downloads.html

5

standards aim at providing a unique representation of Indian languages in Roman alphabet as detailedin [34]. SSF and CONLL formats are detailed as follows:

1.1.3.1 SSF (Shakthi Standard Format)

SSF allows information in a sentence to be represented in the form of one or more trees together witha set of attribute-value pairs with nodes of the trees. For example, the following sentence from Telugutreebank is represented in the SSF format as shown.

<Sentence id=’58’>1 (( NP <fs af=’AMXrapraxeS,n,n,sg,,,lo,lo’ drel=’k7p:VGF’ name=’NP’>1.1 AMXrapraxeSlo NNP

))2 (( NP <fs af=’nirvahaNa,n,n,sg,,d,0,0’ drel=’k1:VGF’ name=’NP2’>2.1 vixyA NN2.2 nirvahaNa NN

))3 (( NP <fs af=’SAKa,n,n,pl,,,xvArA,xvArA’ drel=’k3:VGF’ name=’NP3’>3.1 praBuwva NN3.2 SAKalaxvArA NN

))4 (( VGF <fs af=’jarugu,v,fn,sg,3,,wA,wA’ name=’VGF’ voicetype=’active’>4.1 jaruguwuMxi VM4.2 . SYM

))</Sentence>

Each tree/sentence is made up of one or more related nodes/chunks. Every node has the followingproperties:

1. Address - referred to by property name ADDR

2. Token - accessed by attribute name TKN

3. Category - accessed by attribute name CAT

4. Others - used to store user-defined features, which are accessed through their feature names orattribute names.

As an example, the following node/chunk for the example sentence shown is detailed below.

6

Address Token Category Attribute-Value Pairs2 (( NP <fs af=’nirvahaNa,n,n,sg,,d,0,0’ drel=’k1:VGF’ name=’NP2’>2.1 vixyA NN <fs af=’vixya,n,f,sg,,,0 kA,0 A’ name=’vixyA’ >2.2 nirvahaNa NN <fs af=’nirvahaNa,n,n,sg,,d,0,0’ name=’nirvahaNa’>

))

In a sentence tag ”id” refers to the sentence identity in the given annotated corpora. In the examplesentence ’id’ value is ’58’. Chunk and the lexical words have basic properties like ’Address’, ’Token’,’Category’, and ’Attribute-Value Pairs’. The example sentence shown in this section has 4 chunks.

1. ’Address’ is for the purpose of tracking down the chunk element or the lexical word.

2. ’Token’ is the actual word from the sentence, but it also specifies chunk boundaries at chunk levelusing open parenthesis and closed parenthesis. Address and token values for the example chunkare ’2’ and ’((’ in which open parenthesis denotes the chunk starting boundary.

3. Attribute features at chunk level includes ’name’, ’head’, and ’drel’. The attribute ’name’ denotesthe name of the chunk and it is unique for each chunk at the sentence level which acts as anidentifier of the chunk at sentence level. ’NP2’ is the value of ’name’ attribute in the examplechunk. The ’head’ attribute of the chunk denotes the representing lexical word of that chunk. Forexample, ’nirvahaNa’ denotes the head token of the example chunk ’2’. Dependency attributeis specified with the attribute key ’drel’. In the example sentence, chunk with address ’2’ hasdependency attribute value ’k1:VGF’ this denotes that the current chunk is attached to the parentchunk which has ’name’ attribute value ’VGF’(chunk with address value as ’4’) and ’k1’ is thedependency label assigned to that attachment.

At lexical word level, we use feature sets encoded in the < fs > tag. The feature set contains attribute’af’ which denotes an abbreviated feature structure. It contains various lexical level information detailedin the below example. At token level we have a ’name’ attribute similar to the ’name’ attribute mentionedat chunk level. These eight feature attributes are as follows:

vixya, n, f, sg, , 0 kA, 0 A’| | | | | | | |

root cat gen num per cas vib/tam suf(1) (2) (3) (4) (5) (6) (7) (8)

1. root: Root form of the word

2. cat: Course grained POS

3. gen: Masculine/Feminine/Neuter

7

4. num: Singular/Plural

5. per: First/Second/Third person

6. cas: Oblique/Direct case

7. vib/tam: (suffix/post-positions, etc.)/TAM (tense, aspect and modality)

8. suf : Suffix of the word

1.1.3.2 CONLL Format

CONLL is a widely accepted format for dependency annotated corpora in the NLP community. Allthe state-of-the-art parsers like MaltParser, MSTParser and TurboParser take inputs by the CONLLformat. Annotations are encoded in plain text files (UTF-8, using only the LF character as the linebreak) with three types of lines:

1. Word lines containing the annotation of a word/token in 10 fields separated by single tab charac-ters;

2. Blank lines marking sentence boundaries.

3. Comment lines starting with hash (#).

Sentence shown in SSF format in the section 1.1.3.1 is represented here in CONLL format:

1 AMXrapraxeSlo AMXrapraxeS NNP cat-n|gen-n|num-sg|pers-|case-|vib-lo|tam-lo|chunkId-NP|stype-|vtype- 4 k7p

2 nirvahaNa nirvahaNa NN cat-n|gen-n|num-sg|pers-|case-d|vib-0|tam-0|chunkId-NP2|stype-|vtype- 4 k1

3 SAKalaxvArA SAKa NN cat-n|gen-n|num-pl|pers-|case-|vib-xvArA|tam-xvArA|chunkId-NP3|stype-|vtype- 4 k3

4 jaruguwuMxi jarugu VM cat-v|gen-fn|num-sg|pers-3|case-|vib-wA|tam-wA|chunkId-VGF|stype-|vtype- 0 main

Sentences consist of one or more word lines, and each word line contain the following 10 columns:

1. ID: Word index, integer starting at 1 for each new sentence; may be a range for tokens withmultiple words.

2. FORM: Word form or punctuation symbol.

3. LEMMA: Lemma or stem of word form.

4. UPOSTAG: Universal part-of-speech tag drawn from our set of POS tags shown in A.1

5. XPOSTAG: Language-specific part-of-speech tag; underscore if not available.

6. FEATS: List of morphological features from the universal feature inventory or from a definedlanguage-specific extension; underscore if not available.

8

7. HEAD: Head of the current token, which is either a value of ID or zero (0).

8. DEPREL: Universal Stanford dependency relation to the HEAD (root if HEAD = 0) or a definedlanguage-specific sub-type of one.

9. DEPS: List of secondary dependencies (head-deprel pairs).

10. MISC: Any other annotation.

In the example sentence, “UPOSTAG” is not shown because of space constraints. FEATS set includecoarse POS category, gender, number (plural or singular), vib denotes the vibakthi which is kind of asuffixes appended to the word, tam denotes the tense aspect modality (TAM) as shown in for each nodein the sentence. Extended features include ’stype’ denoting the sentence types such as interrogative,statement, etc. ’vtype’ specifies the voice type of the sentence (passive or active). The HEAD column ofthe word line is ’4’ in this example, specifying that it is attached to the word line with ID value equal to’4’ in the sentence. DEPREL specifies the dependency relation labeled to the attachment. In the givensentence, dependency relation is ’k1’. Words of the example sentence are represented in wx-notationwhich is detailed under subsection 1.1.3. Last two column values are not considered for the Telugu andHindi annotated corpora used in our experiments.

1.1.4 Telugu Treebank

Telugu language is the fifteenth most spoken language worldwide, and third most spoken languagein India. Telugu has a rich history, with inscriptions and Telugu words being found dating back to 400BC to 100 BC [68]. Like many Indian languages Telugu is a free word order language. Following aresome examples illustrating the free word order in Telugu. Words or lexicons of the example sentencesare represented using wx-notation as detailed under subsection 1.1.3. Presently, Telugu treebank hasforty thousand manually validated tokens and it contains 2593 sentences.

(1) Ramuxu Siwaku puswakamunu ichaxu.‘Ram’ ‘to Sita’ ‘book’ ‘gives’‘Ram gives a book to Sita’

(2) Siwaku Ramudu puswakamunu ichaxu.‘to Sita’ ‘Ram’ ‘book’ ‘gives’‘Ram gives a book to Sita’

(3) puswakamunu Ramudu Siwaku ichaxu.‘book’ ‘Ram’ ‘to Sita’ ‘gives’‘Ram gives a book to Sita’

(4) Siwaku puswakamunu Ramudu ichaxu.‘to Sita’ ‘book’ ‘Ram’ ‘gives’‘Ram gives a book to Sita’

9

(5) Ramudu Siwaku puswakamunu ichaxu.‘Ram’ ‘to Sita’ ‘book’ ‘gives’‘Ram gives a book to Sita’

(5) Ramudu puswakamunu Siwaku ichaxu.‘Ram’ ‘book’ ‘to Sita’ ‘gives’‘Ram gives a book to Sita’

Telugu is an agglutinative language and so, case markers are agglutinated with most of the words insentence realization. Because of this Telugu is more flexible as a free word order language and can havehuge number of syntactic variations compared to other south Asian languages.

1.1.5 Anomalies in Treebanks

As specified earlier, treebank annotation process is an expensive and time consuming process. An-notators are trained to understand the annotation guidelines. As different people have various levels ofunderstanding and the way they perceive these guidelines, may not be the same even after a good amountof training based on the difficulty level of annotation guidelines. Further, the process of annotation isrepetitive and humans are prone to make errors in such processes because of varied factors like fatigue inspite of right understanding and interpretation of the annotation schema. Even typos and format errorsmay happen during annotation, which can be easily detected and later corrected using sanity checkingtools (which we didn’t address in our work).

Based on the anomalies observed in the treebanks, possible error types observed are POS tag errors,morph errors, chunk tag errors and dependency labels errors. In our annotation schema POS tagging,morph analysis and chunk boundary identifying guidelines are relatively easy. Thus, the correct inter-pretation of these guidelines can be achieved after a good amount of training, whereas the dependencylabel annotation guidelines are challenging in most cases because of complex realizations. Traditionally,POS tag, morph and chunk errors are detected using rule based systems [4], where they showed empir-ically that their rule based system is doing better than probabilistic models. Major errors that affect thequality of the treebank are syntactico-semantic level dependency labels or karaka relations. In this work,we focus on the dependency label error detection because such errors need a quite amount of time forvalidation from experts. Thus to save time and cost of the validation process of the treebank, we designsmart tools to detect these possible error instances in the treebank.

1.2 Contributions of thesis

1. We proposed a novel language independent automatic error detection approach to detect errorsin annotated corpora using state-of-the-art dependency parsers. We address the skewed dataset(present in most of Indian language treebanks ) problem faced by existing statistical methods oferror detection using word embedding.

10

2. We have introduced the automatic error correction tool for Indian languages treebanks.

1.3 Thesis Outline

Chapter 2 discusses the previous works related to treebank anomaly detection.Chapter 3 presents the dependency parsers used for error detection tools.Chapter 4 describes our approach and method designed to detect errors in treebanks considering

skewed data set problem.Chapter 5 provides details of early attempts to automatically correct the errors in treebanks of Indian

languages.Chapter 6 concludes the work and defines the scope for future directions, building on the current

work.

11

Chapter 2

Related work

Anomaly detection in annotated corpora has been a major research concern for more than twodecades. In early works, [7] mentions the importance of consistency or quality of the tagged corpus.Their observation of higher inconsistencies in the manually tagged corpus lead them to propose a solu-tion to address these inconsistencies. [61] highlights the fact that true accuracy of a classifier could bebetter or worse than reported, depending on the quality of the corpus used for the evaluation. Evaluatingtwo taggers on the WSJ (Wall street Journal annotated corpora), they found tagging accuracy rates forambiguous words as 91.35% and 92.82%. Given the estimated 3% error rate of the WSJ tagging [44],they argue that the difference in performance is not sufficient to establish which of the two taggers isactually better. Later, [66] proposed a tool to correct these inconsistencies in manually tagged corpus toenhance the quality of the corpus. They proposed an automatic tagger to tag these syntactic annotationsautomatically and tried to compare the output of the tool to manually tagged data to detect the inconsis-tencies. They empirically realized that WSJ (Wall street journal) tagged data consists of considerablenumber of inconsistent instances. Later, [32] proposed a similar method to automatically detect and cor-rect these errors based on the automatic taggers developed. These methods do not work with mediumor small data sets. They require considerable amounts of data to assign tag accurately.

[28] proposed a method based on variation n-grams to detect POS tag errors in annotated corpora.They design an efficient way of computing all variation n-grams for a corpus and define heuristics (trustvariation, distrust variation) for deciding whether a particular variation is an ambiguity or an error. Theywere able to detect a large number of tagging errors in the WSJ corpus with high precision. Later,they [29] extended the variation n-gram method to handle discontinuous constituents. They evaluatedthe variation n-grams error detection method for discontinuous syntactic constituents on TIGER [15]corpus and detected a large number of errors with high precision.

Thereafter, [41] and [39] report their experiences from the work on English treebank of spontaneousspeech data. They summarize the methods of arriving at a consistent annotated treebank by mention-ing style books, automatic partial parsers and inter-annotator agreement. The partial parsing approachproved to be the most useful as the treebank was annotated by only one annotator. Further, they de-scribed an automatic consistency checker, for a given phrase type, it extracts all strings of words which

12

have been annotated with the given phrase type. [67] and [23] experimented on the hypothesis that cer-tain words in the corpora are the major cause of inconsistency or errors. They applied parser on a largetext corpus and then split the corpus into two, one section with those sentences parsed by the parser andthe other one includes those sentences for which the parser failed. They identified the words or n-gramsthat occur in the sentences where the parser failed to parse and not present in the sentences that aresuccessfully parsed. They claimed that the reported list of words caused the major inconsistency or theerrors.

All the methods mentioned above are biased towards positional languages where a position of theword is the primary source of information. Indian languages are free word order languages, where aprimary source of information is the relation between the words in a sentence and syntactic cues presentin the sentence. N-grams are based on position so they work poorly for free word order languagescompared to the positional languages. A large number of inflections of the words in Indian languagesmakes it even more challenging using N-gram variations considering the limited annotated corpora orthe text available for Indian languages.

In the context of Indian languages, an earlier effort on error detection was made by [6], using a statis-tical module and a rule based post processing module to detect anomalies. The statistical module usedin the system is called FBSM (Frequency Based Statistical Module), which is based on the hypothesisthat errors or inconsistent patterns are rare. So, they computed the frequencies and then, they identifiederrors which had frequencies lesser than a certain threshold. Later, [4] proposed a different statisticalmodule called the PBSM (Probability Based Statistical Module) to overcome data sparsity limitationpresent in FBSM. To increase the precision of the system, they used rules to prune out false positives.Thereafter, they proposed a hybrid system comprising a statistical module and rule based system. [2]extended the previous work further, by replacing PBSM with EPBSM (Extended Probability Based Sta-tistical Module) in which they proposed rule based pre and post processing modules to further prune theerroneous cases which were misleading the statistical module. In the proposed system, they extractedfeatures like POS tag, vib, tam, count of a grand parent of the current node and the siblings, to form afeature vector and the frequency or the probability using maximum entropy methods and threshold onthis probability, to identify the errors. The major issue with their method is skewed dataset which isvery common for most of the Indian languages. Based on the results shown, the performance of the sta-tistical system is poor when there is no supporting rule based system. The features like grand parent andsibling counts make the data even more sparse. Recently, [4] used MaltParser to parse the sentences andthen classified the dependency labels into two sets one with accurately parsed instances and other onewith inaccurate ones. Later, they mapped the test set dependency relations into accurate or inaccuratecategories. They identified annotated dependency label as an error, either if a given dependency labelwas falling into accurate category and mismatching with the MaltParser output, or it was falling intoinaccurate category and matching with MaltParser output. As dependency annotation is a complex pro-cess because of various factors such as varied contexts and complex sentence realizations, MaltParseralone is unable to capture all the scenarios of parsing. Parsers are sensitive to some of the realizations

13

of the sentences because of their inherent mathematical model. For example, MaltParser is not goodat long distance dependencies while MSTParser captures them effectively. In our approach, we usedvarious methods to detect the errors based on the outputs of the three state-of-the-art parsers. We usedthese parsers output extensively to detect the errors while solving the above mentioned issues. We areable to detect the errors with high precision.

After detecting the errors, we attempt to auto correct the errors in the annotated corpora. In the con-text of error correction, [24] proposed a method to correct parts of speech (POS) errors. The proposedmethod first detects the errors in the corpora using variation n-gram method [28] and replaces the erro-neous tags with new ones generated by the existing state-of-the-art POS tagger. The proposed approachis based on the hypothesis that the taggers will learn the consistent patterns from the corpora. Later,they extended the method by grouping the similar words into same category and also proposed newcomplex ambiguity tags by merging multiple ambiguous tags into a single tag for example (RB/RP).[25] proposed a method to include the ambiguity tags into the tag-set and empirically proven that suchtags enhance the quality of the annotated corpora. Further, [26] experimented with dependency cor-rection, they proposed context-aware feature-based-models using memory based learning to predict thecorrect labels. Based on the empirical results, they realized that stacking up the context based featuresdon’t yield required correction precision values hence, they tried out generic models to constrain thefeature models by introducing ambiguity tag as a new feature. In our work, we tried out some heuristicbased methods on the parsed outputs and then experimented with the methods proposed by [26] in thecontext of Indian language treebanks. As per our knowledge, this is the first attempt to correct errors inthe annotated corpora of Indian languages.

14

Chapter 3

Dependency Parsers

In this chapter, we introduce dependency parsers such as MaltParser, MSTParser and TurboParser.We leveraged these parsers to propose our methods for error detection and correction. Later, we detailedthe computation of confidence scores for Malt and MST parsers in section 3.5. Section 3.6 outlined thecomparison of these parsers.

Parsing has been a well known phenomenon in natural language processing for the past few decades.Parsing is the syntactic analysis of a natural language sentence which may include semantic informationbased on the formal grammar. Phrase structure grammar and dependency grammar are the notableformalisms known in computational linguistics. Phrase structure grammar is based on the constituency-based parse trees which distinguish between terminal and non-terminal nodes. It is a generative grammarwhich uses rewrite rules to structurally generate a sentence. Dependency parsing is annotating therelations among the nodes in a sentence having majorly semantic meaning. Dependency representationsof sentences ([37] and [9]) model head-dependent syntactic relations as edges in a directed graph. Figure3.1 shows the phrase structure and dependency structure of an example sentence ‘The big dog chaseddown the cat’.

Figure 3.1 Phrase Structure and Dependency Structure respectively for the English sentence “The bigdog chased down the cat ”

Phrase structure grammar primarily gets the information from the position of the node in a sentencewhereas, dependency grammar gets information from the node itself to assign relations between twonodes. Naturally, for free word order languages like Indian languages (Hindi, Telugu, Bengali, ... etc)dependency framework is better suited as compared to phrase structure grammar, as proposed in [9],[37], [53] and [63]. Thus, in this work we consider dependency framework for parsing Indian languages.

15

3.1 Dependency parsers

Parsing involves extracting information and encoding this information at different granularity levels.First part of the parsing is to identify morph information of each word. For every word in the inputsentence, a dictionary or a lexicon needs to be looked up and associated grammar needs to be retrieved.Next step, involves parts of speech (POS) tagging, assigning a word categories to each word based onthe context. Thereafter, words in the sentence need to be grouped (which we call local word grouping)to form verbals and nominals. After that, one needs to identify the modified-modifier word pairs in thesentence to assign dependency relations among these local word groups. 1.1.3.1 shows the sentencewhich is dependency parsed with all the information encoded in SSF format detailed in chapter 1. Thetask of identifying modifier-modified (parent-child) relations and assigning proper dependency relationlabel is a difficult task in dependency parsing as compared to other tasks mentioned. For the task ofdependency parsing, various algorithms are proposed on top of dynamic programming, constraints anddeterministic parsing formalism. Multiple dependency parsers are available based on different tech-niques. In our experiments we used MaltParser, MSTParser and TurboParser for dependency parsing ofthe treebanks, these three parsers are detailed in the following sections.

3.2 MaltParser (A Transition-based Dependency Parser)

MaltParser 1 is a transition based deterministic parser ( [56] and [58]). It can be classified as adata-driven parse generator. While a traditional parse generator constructs a parser, given a grammar, adata-driven parser-generator constructs a parse, given a treebank. It is an implementation of inductivedependency parsing [56], where the syntactic analysis of a sentence amounts to the derivation of a depen-dency structure, and where inductive machine learning is used to guide the parser at non-deterministicchoice points. The parsing algorithm contains two major components described below.

1. A transition system for mapping sentences to dependency trees

2. A classifier for predicting the next transition for every possible system configuration

Given these two components, dependency parsing can be realized as deterministic search through thetransition system, guided by the classifier. With this technique, parsing can be performed in linear timefor projective dependency trees and quadratic time for arbitrary (possibly non-projective) trees [57].

MaltParser implements various deterministic parsing algorithms like Nivre-arc-eager, Nivre-arc-standard, and covington incremental etc. The arc-eager algorithm builds a labeled dependency graphin one left-to-right parse over the input. A configuration in the arc-eager projective system contains astack holding partially processed tokens, an input buffer containing the remaining tokens, and a set ofarcs representing the partially built dependency tree. Other algorithms are variations of the arc-eager

1http://www.maltparser.org/

16

algorithm detailed in [57]. There are four possible transitions detailed below. Here we assume, ‘top’ isthe token on top of the stack and ‘next’ is the next token in the input buffer.

1. LEFT-ARC (r) : Add an arc labeled r from next to top; pop the stack.

2. RIGHT-ARC (r): Add an arc labeled r from top to next; push next to the stack.

3. REDUCE: Pop the stack.

4. SHIFT: Push next onto the stack.

Figure 3.2 Sentence from Penn Treebank

Figure 3.3 Sequence of steps in dependency parsing using MaltParser

Consider an example English sentence ‘Economic news had little effect on financial markets.’takenfrom the Penn Treebank [44]. Figure 3.2 shows the dependency tree for the sentence.

Figure 3.3 shows the sequence of steps for parsing the example sentence described in Figure 3.2using an arc-eager algorithm. The proposed MaltParser algorithm can only parse the projective (nested

17

or non crossing) sentences. Assuming a unique root as the leftmost word in the sentence, a projectivegraph is one that can be written with all words in a predefined linear order and all edges drawn on theplane above the sentence, with no edge crossing another. Figure 3.2 shows the projective parse tree.To handle Non-proejctive parse trees, [60] came up with a pseudo-projective method in which trainingdata is projectivized by a minimal transformation, lifting non-projective arcs one step at a time, andextending the arc label of lifted arcs using the encoding scheme called HEAD.

Machine learning classifiers available for the MaltParser includes LIBSVM with a polynomial kernelbased on the support vector machine and LIBLINEAR etc. Input data is represented in the CONLLformat detailed in section 1.1.3.2 in Chapter 1.

3.3 MSTParser (A Graph-based Dependency Parser)

MSTParser 2 builds a complete graph over words of a sentence, each word being a node. It givesweights to the edges of the graph, based on prior learned contexts. It then proceeds to choose themaximum spanning tree as the output parse. MSTParser internally represents a generic directed graphG = (V,E) by its vertex set V = {v1, ..., vn} and set E ⊆ [1 : n]x[1 : n] of pairs (i, j) of directededges vi → vj . Each edge is weighted with the score s(i, j). Maximum spanning tree (MST) y ⊆ E inthe graph G is defined as the edge that maximizes the value

∑(i,j)∈y s(i, j), such that all the vertex V

in graph G must appear in y.

Let’s consider the generic sentence representation as x = x1...xn, and generic dependency tree y

for a given sentence x. Considering y as the set of tree edges, (i, j) ∈ y if there is a dependency in y

from word xi to word xj . A sum of all the scores of edges in the tree is considered as the score of thedependency tree. In particular, the score of an edge is the dot product between a weight vector and ahigh dimensional feature representation of the edge,

s(i, j) = w · f(i, j)

Thus, the score of a dependency tree y for sentence x is,

s(x, y) =∑

(i,j)∈y

s(i, j) =∑

(i,j)∈y

w · f(i, j)

Considering an appropriate weight vector w as well as a feature representation, dependency parsingis the task of finding the highest score dependency tree y for a given sentence x.

Let’s assume that the weight vector w is known and thus, we know the score s(i, j) of each possibleedge. In the next sections, we present a method for learning the weight vector.

2https://sourceforge.net/projects/mstparser/

18

3.3.1 Parsing Algorithm

To find the highest scoring non-projective tree, one simply needs to search the entire space of span-ning trees with no restrictions. MSTParser uses the efficient Chu-Liu-Edmonds algorithm detailed in[17] & [30] to find the MST in directed graph. Informally, the algorithm has each vertex in the graphgreedily select the incoming edge with the highest weight. If a tree results, it must be the maximumspanning tree. If not, there must be a cycle. The procedure identifies a cycle and contracts it into a singlevertex and recalculates edge weights going into and out of the cycle. It can be shown that a maximumspanning tree on the contracted graph is equivalent to a maximum spanning tree in the original graph[33]. Hence, the algorithm can recursively call itself on the new graph. To find the highest scoring non-projective tree for a sentence, x, the graph Gx is constructed and ran through the Chu-Liu-Edmondsalgorithm. The resulting spanning tree is the best non-projective dependency tree.

3.3.2 Learning and Feature Selection

In the previous section, we saw how to find the dependency tree y for a given sentence x, given thehigh dimensional feature representation of the edge f(i, j) and the weight vector w. In this section, wepresent how to compute the weight vector and the high dimensional feature representation.

Weight vector w is computed using the Margin Infused Relaxed Algorithm (MIRA) [48], an onlinelarge-margin learning algorithm. The ability of the algorithm is based on a rich set of features overparsing decisions, as well as surface level features relative to these decisions. To illustrate, let’s considerit incorporates features over the part of speech of words occurring between and around a possible head-dependent relation. The accuracy is highly dependent on the features set defined since they eliminateunlikely scenarios such as a preposition modifying a noun not directly to its left, or a noun modifying averb with another verb occurring between them.

MSTParser uses three different types of features namely, basic, extended, and second-order features.Consider an edge (i, j) ∈ y, where xi is the modifier and xj is the head. The basic set of features usedby MSTParser are shown in Figure 3.4 a and b. The unigram features provide the information about themodifier and the head separately. The bigram features provide the conjoined information of the modifierand the head together.

19

Figure 3.4 Features used by MSTParser. where xi is the head and xj the modifier

3.4 TurboParser

TurboParser is an example of a constraint based parser which constructs the problem of non-projectivedependency parsing as a polynomial-sized integer linear program. It can also encode prior knowledgeas hard constraints, and learn soft constraints from data. In particular, its model is able to learn cor-relations among neighboring arcs, word valency, and tendencies toward projective parses. These ap-proximate parsers have been introduced, based on belief propagation [65], dual decomposition [40],or multi-commodity flows ([46][45] ). These are all instances of TurboParsers, as shown by [47], theunderlying approximations come from the fact that they run global inference in factor graphs ignoringloop effects. In fact [47] proposed variational approximation bridging the gaps between the [65] and([46]) solutions.

3.4.1 Parsing

Let an input sentence be denoted by x and the set of possible dependency parses for x be denotedby Yx. A generic linear scoring function based on a feature vector representation g is used in parsingalgorithms that seek to find:

parse∗(x) = argmaxy∈YxwT g(x, y)(1)

The score is parameterized by a vector w of weights, which are learned from data (most commonlyusing Margin Infused Relaxed Algorithm (MIRA), [48]). The decomposition of the features into local“parts” is a critical choice affecting the computational difficulty of solving Eq. 1. The most aggressivedecomposition leads to an “arc-factored” or “first-order” model, which permits exact, efficient solutionof Eq. 1 using spanning tree algorithms [49] or, with a projectivity constraint, dynamic programming[31]. Second- and third-order models have also been introduced, typically relying on approximations,since less-local features increase the computational cost, sometimes to the point of NP-hardness [50].TurboParser attacks the parsing problem using a compact integer linear programming (ILP) representa-tion of Eq. 1 [46], then employing alternating directions dual decomposition (AD3;[45]). This enablesinclusion of second-order features ( e.g., on a word with its sibling or grandparent;[16]) and third-orderfeatures (e.g., a word with its parent, grandparent, and a sibling, or with its parent and two siblings;

20

[40]). For a collection of (possibly overlapping) parts for input x, Sx (which includes the union of allparts of all trees in Yx), we will use the following notation. Let

g(x, y) = sums∈Sxfs(x, y), (2)

where fs only considers part s and is nonzero only if s is present in y. In the Integer Linear Pro-gramming(ILP) framework, each s has a corresponding binary variable zs indicating whether part s isincluded in the output. A collection of constraints relating zs define the set of feasible vectors z thatcorrespond to valid outputs and enforce the agreement between parts that overlap. Many different ver-sions of these constraints have been studied ([65] [46], [47]). A key attraction of TurboParser is thatmany overlapping parts can be handled, making use of separate combinatorial algorithms for efficientlyhandling subsets of constraints. For example, the constraints that force z to encode a valid tree canbe exploited within the framework by making calls to classic arborescence algorithms ([17] [30]). Asa result, when describing modifications to TurboParser, we need to explain additional constraints andfeatures imposed on parts. TurboParser provides three settings for training: basic, standard and full.

1. Basic: enables arc-factored parts.

2. Standard: enables arc-factored parts, consecutive sibling parts and grandparent parts.

3. Full: enables arc-factored parts, consecutive sibling parts, grandparent parts, arbitrary siblingparts and enables head bi-gram parts.

3.5 Confidence score computation for parsers

3.5.1 MaltParser

[38] proposed a method to assign dynamic confusion score to the dependency labels attached todependency trees, based on the uncertainty of choosing the best arc from the available actions. Theentropy of assigning a label to the attachment is computed and is assigned as the confusion score. Theconfusion score is on a scale of 3. We take an additive inverse of the confusion score to convert it intothe confidence score which is computed by subtracting confusion score from the value three. This con-fidence score is used to detect errors from annotated corpora.The last column in the following sentencerepresented in CONLL format shows the confidence score assigned for each of the label assigned to thedependency arc.

1 AMXrapraxeSlo AMXrapraxeS NNP cat-n|gen-n|num-sg|pers-|case-|vib-lo|tam-lo|chunkId-NP|stype-|vtype- 4 k7p 2.045

2 nirvahaNa nirvahaNa NN cat-n|gen-n|num-sg|pers-|case-d|vib-0|tam-0|chunkId-NP2|stype-|vtype- 4 k1 2.552

3 SAKalaxvArA SAKa NN cat-n|gen-n|num-pl|pers-|case-|vib-xvArA|tam-xvArA|chunkId-NP3|stype-|vtype- 4 k2 1.993

4 jaruguwuMxi jarugu VM cat-v|gen-fn|num-sg|pers-3|case-|vib-wA|tam-wA|chunkId-VGF|stype-|vtype- 0 main 2.99

21

3.5.2 MSTParser

An extension to the MST parser [52] describes a method to calculate the confidence score for the

dependency arcs in dependency parsing. We extended this method to give the confidence score for the

dependency labeling in the MSTParser. In this method, confidence score is computed as the weighted

linear combination of KD-fix and Delta methods detailed in the [52]. KD-fix is a probabilistic method

which samples the weighted vectors from the Gaussian distribution for which mean parameters and

covariance Matrix are learned from the development data. The confidence score of the each label as-

signment is the probability of this label induced from the distribution over parameters. Confidence score

is computed as the ratio between the number of parsed trees with the final label assigned to the edge in

the k sampled parse trees generated using k sample models to the numbr k. This confidence score lies

between 0 to 1. Delta method generates the next best parse tree which is forced not to assign the actual

label. Confidence score for the given label assignment is the difference between the parse tree scores

of the actual parse tree and the alternatively generated parse tree. The score of a parse tree is sum of

scores of all the labels assignments. These confidence scores are always positive and lies between 0 to

1. The last column in the following sentence represented in CONLL format shows the confidence score

assigned for each dependency label assigned to the arc.

1 AMXrapraxeSlo AMXrapraxeS n cat-n|gen-n|num-sg|pers-|case-|vib-lo|tam-lo|chunkId-NP|stype-|vtype- 4 k7p 0.417

2 nirvahaNa nirvahaNa n cat-n|gen-n|num-sg|pers-|case-d|vib-0|tam-0|chunkId-NP2|stype-|vtype- 4 k1 0.233

3 SAKalaxvArA SAKa n cat-n|gen-n|num-pl|pers-|case-|vib-xvArA|tam-xvArA|chunkId-NP3|stype-|vtype- 4 k2 0.35

4 jaruguwuMxi jarugu v cat-v|gen-fn|num-sg|pers-3|case-|vib-wA|tam-wA|chunkId-VGF|stype-|vtype- 0 main 0.9

3.5.3 TurboParser

We didn’t capture the confidence score for the TurboParser as it is a constraint based parser.

3.6 Comparison of Malt, MST and Turbo Parsers

[51] compared the accuracy of MSTParser with MaltParser on a number of structural and linguis-tic dimensions. They observed that, though the two parsers exhibit indistinguishable accuracies over-all, MSTParser tends to outperform MaltParser on longer dependencies as well as those dependenciescloser to the root of the tree (e.g. verb, conjunction and preposition dependencies), whereas MaltParserperforms better on short dependencies and those further from the root (e.g. pronouns and noun de-pendencies). Since long dependencies and those near to the root are typically the last constructed intransition-based parsing systems, it was concluded that MaltParser does suffer from some form of errorpropagation due to greedy parsing procedure. On the other hand, the richer feature representations ofMaltParser led to improved performance in cases where error propagation has not occurred. MaltParser

22

speed actually depends on the ML classification algorithm. MaltParser (LIBSVM, arc-eager) is slowercompared to Turbo and MST parsers.

23

Chapter 4

Error detection

In the creation of annotated corpora, dependency annotation is the major bottleneck or rate determin-ing step as mentioned earlier in chapter 1. In this chapter, we present different approaches in the section4.1 to automatically detect dependency anomalies in annotated corpora. We validate our approaches byconducting various experiments on Hindi and Telugu treebanks and discuss the results obtained undersection 4.1.3.

4.1 Error detection in annotated corpora

The errors that creep into annotated corpora are highly skewed as erroneous instances are rare com-pared to normal ones. Our assumption is that such issues have less bearing on the statistical predictionsthat a parser makes. We assume that a parser will learn the consistent annotations in a learner corporaand prune out the skewed inconsistencies. To this end, we have chosen three state-of-the-art statisticalparsers namely MaltParser, MSTParser and TurboParser based on some compelling references whichuse different learning and parsing strategies.

MaltParser uses a transition based algorithm to parse a sentence by learning the next best actionfrom the training data. The building blocks of MaltParser and various algorithms are detailed in thesection 3.2. MSTParser searches for the maximum spanning tree in a directed graph using [49] variousalgorithms as described in the section 3.2. TurboParser uses Integer linear Programming techniques toparse a sentence. More details are discussed in the section 3.4. These three parsers use various learningalgorithms and fundamentally different techniques to capture the information of linguistic dependencyarcs. Hence each parser has the strong or weak capabilities to capture various aspects of the languagephenomenon as explained in the section 3.6. We pool the predictions of these parsers in multiple novelways to identify the errors in a treebank.

24

4.1.1 Our approach

We use various strategies like simple ensemble approach, extended simple ensemble approach,weighted voting and ML classifiers to ensemble the parser outputs and confidence scores to detectpossible errors in treebanks. These methods are detailed in this section.

4.1.1.1 Simple ensemble approach or Voting

In this approach, we compare the parsers parses given by the parsers and identify errors based on theparser’s agreement or disagreement on attachment and label. Error classification based on the ensembleof the Malt, MST and Turbo parsers output are defined as follows.

4.1.1.1.1 Error classification Following are the four types of errors that we define

1. Type 1: Type 1 errors are label errors defined based on a mismatch between parsers outputs andhuman annotation. If a dependency label annotated by a human differs with the parser’s output,and the parsers don’t differ among themselves, then the dependency label is treated as a possibleerror instance. However, there is no disagreement on the attachment among the parsers and thehuman annotation.

2. Type 2: Type 2 errors are also label errors based on the ambiguity among all the parsers output andhuman annotation. If the attachment marked by all the parsers and the human annotation agree onthe same and the dependency labels marked by all the parsers differ among themselves and differwith the human annotation, then the annotated label is marked as a possible error. It is furtherextended such that, if any of the two parsers agree on the same label and differ with the labelmarked by the third parser and with the human annotation, then the dependency label is treatedas a possible error. The intuition behind the current error classification is that the disagreementamong all the parsers and human annotation gives a clue that there is an ambiguity in a giveninstance, thus there is a possibility of potential human annotation error. The reasons for theambiguity may be due to a rare occurrence of the given instance or lesser context.

3. Type 3: Type 3 error instances are attachment errors defined based on a mismatch between parsersoutputs and human annotation. If the attachment annotated by human differs from the attachmentgiven by the parsers which are agreeing among themselves, then the annotated attachment istreated as a possible error. The intuition here is that since all the parsers agree on the attachmentbased on the consistency of treebank, it may be possible that the human annotated attachment iseither an exception or an error.

4. Type 4: Type 4 errors are also attachment errors defined on the basis of the ambiguity amongall the parsers output and human annotation. If the attachments marked by all the parsers differamong themselves and all of them differ with the human annotated attachment, then the attach-ment is treated as a possible error instance. It is extended to include even if any of the two parsers

25

output agree on the attachment and differ with the attachment marked by the third parser and withthe human annotation, then the attachment is treated as a possible error. The intuition here is thatthe difference in attachments given by parsers and the annotator indicates the ambiguity, hencethere is a possibility of an error instance.

4.1.1.2 Extended simple ensemble approach or ranking

In simple ensemble approach, we have compared all the three parsers labels and the attachmentsassigned. The current method extends with confidence score values generated by the parser. Confidencescore depicts the parser’s confidence for each instance of the parse in the data. Higher the confidencescore lesser the ambiguity for parsing the given instance. It means that, the parser has the knowledge ofthat kind of instances from training data. Lesser confidence score indicates ambiguity around the giveninstance or rare instances in the training data. The proposed confidence score allows us to define moreaccurate possible error instances and classify them based on the accuracy values. In this method, wehave computed the Malt, MST parsers confidence score for each instance based on the existing methodsdescribed in the section 3.5. These confidence score values are used to classify the errors detected usingsimple ensemble approach into 4 classes namely A, B, C and D in decreasing order of the accuracy. Thealgorithm is shown in the Figure 4.1 which prioritize the detected errors assuming best thresholds forMaltParser and MSTParser, are Th Malt=1.5 and Th MST=0.5 respectively. These two thresholds arecomputed by maximizing the F-score of the A class errors and B class errors detected.

Figure 4.1 Algorithm to prioritize the errors into 4 classes

26

4.1.1.3 Weighted voting

Parsers have their own strengths and shortcomings in parsing different sentences, due to their inherentdiverse technical methods. For example, MSTParser can detect long distance modifiers better thanMaltParser whereas, MaltParser is good at learning local dependencies. More details of these parserscomparison is detailed in the section 3.6. We have computed the weights of each dependency label bycalculating the F-score of the parsed instances for a given parser. Later, we normalized these weightsusing the frequency of the corresponding labels. Thus, the weights computed are not biased due toskewed data sets.

The overall voting score is computed based on the following algorithm as shown in the Figure 4.2.All the parses having score lesser than a certain threshold are identified as the potential error instances.

Figure 4.2 Algorithm to find label based on weighted scores

4.1.1.4 Machine learning classifiers

We have posed the error detection problem as a classification problem with two classes namely,anomaly or erroneous instance as a positive class and correct annotation as a negative class. Empiricallywe got the best results for the following feature set.

1. child-parent pair: From the parsed tree we have chosen child-parent pair for which the attachmentand the label is assigned. These words are represented using vectors that use the word embed-ding(Distributed word representations) from the technique suggested in [3]. Word embeddingmap the index of a word in a dictionary to a feature vector in a high-dimension space. Everydimension contributes to multiple concepts, and every concept is expressed by a combination ofa subset of dimensions. Any such mapping is learned by back-propagating the error of a task

27

through the model to update a randomly initialized embedding [18]. For example, the words ap-ple and orange words belong to the category of ‘fruits’. So one can expect that the word vectorsof these words to be near and the vector distance between them is relatively small.

2. MaltParse vs Human annotation: In this feature, we compare the label assigned by MaltParser tothe label assigned by the Human annotator, which assumes value 1 if their parse labels match and0 in case of a mismatch.

3. MSTParse vs Human annotation: This feature compares the parse label given by MSTParser andthe Human annotator. It takes value 0 if they differ, else it assumes value 1.

4. TurboParse vs Human annotation: This feature compares the parse labels assigned by TurboParserand the Human annotator. It takes 0 and 1 values in case of mismatch and match respectively.

5. MaltParse vs TurboParse: This feature value indicates the differences among the parse labelsassigned by Malt and Turbo parsers. It assumes value 1 in case of match and value 0 if they differ.

6. MaltParse vs MSTParse: Feature compares the parse labels given by MaltParser and MSTParser,and it takes value 0 if they differ, else it assumes value 1.

7. TurboParse vs MSTParse: This feature value indicates the differences between the parse labelsassigned by Turbo and MST parsers. It assumes value 1 in case of a match and value 0 if theydiffer.

8. Sentence length : Length of the sentence is an informative feature as intuitively, simple sentencesare easy to parse for human annotators as compared to longer sentences in which there are com-plex concepts that involve long distance relations.

9. Frequency of the current pair and label : Intuitively, high frequency of the child-parent labelindicates positive case whereas low frequency indicates anomaly. In case of larger data sets, thisfeature is more significant in detecting the anomaly.

These features are stacked for all the attachments in a treebank. A cost-sensitive support vector machine(CSSVM) classifier is used for the classification task. Support Vector Machines (SVM) [20] (1995) is awell proven method with guarantees based on statistical learning. It can be used for both classificationand regression tasks. What distinguishes an SVM from other methods is it’s better ability to deal withhigh dimensional data and the guarantee of a globally optimal solution. Anomaly detection problemhas class imbalance issue as most of the class labels in the training data belong to the negative class,which indicates normal (correct) annotations. Since anomalies or errors are rare occurrences, if we traina traditional classifier, the results can be biased towards the negative class, which is predominant in thedata. For example, let the classifier assigns normal or negative class label to every instance, the precisionand recall of the system are high even tough we are unable to mark any instance as an anomaly in thedata. Hence, we propose the use of a cost-sensitive SVM where we can assign the weight for each class,

28

Treebank Sentences words/Sentence chunks/SentenceHindi 2869 12.82 7.95

Telugu 2593 5.53 3.95

Table 4.1 Treebank Data

while learning the classifier. We have chosen the radial basis function (RBF) kernel which is generallyuseful in case of an imbalanced dataset. The objective function is shown in the equation given in theFigure 4.3 which minimizes the cumulative cost of misclassification.

Figure 4.3 Objective function of CSSVM

4.1.2 Data

Experiments were conducted on the Telugu and Hindi treebank data sets. The Hindi treebank is astable and large treebank available for Indian languages. Treebank data details are shown in the table4.1. Telugu treebank has forty thousand manually validated tokens and it contains 2593 sentences. Forexperiments, we used thirty thousand tokens as training data, ten thousand tokens as the test data.

4.1.3 Experiments and Discussion

We stacked the outputs of the parsers described in chapter 3, to address the error detection problem.The following settings were used for various parsers.

1. For MaltParser, we used the ”nivreeager” algorithm, the ”LIBSVM” machine learning algorithmand the features proposed in [64]. Since both, the Hindi and the Telugu treebanks contain aconsiderable number of non-projective structures [14], we used the pseudo-projective algorithmproposed in [59].

2. For TurboParser, we used the settings proposed by [43] and we ran the parser in basic mode forboth, the Hindi and the Telugu treebanks.

3. In case of MSTParser, settings that performed the best were the second order non-projective withbeam width (k-best parses) of 5 and 10 iterations as given in [5]. Apart from the basic parsersettings, we also added a module in MST and Malt parsers to give confidence scores for thedependency labels as described in the section 3.5.

29

Figure 4.4 Errors Ranking based on the Confidence score

Treebank parsers T1 T2 T3 T4 Errors D1 D2 D3 D4 Detected Existing Recall Precision F scoreMalt-Turbo-Mst 292 344 0 32 668 108 100 0 7 215 388 55.41 32.18 40.71

Malt-Turbo 523 191 44 1 759 177 57 7 1 242 388 62.37 31.88 42.19Telugu Malt-Mst 429 279 42 3 753 147 83 6 2 238 388 61.34 31.60 41.71

Turbo-Mst 375 341 49 4 769 123 102 8 1 234 388 60.30 30.42 40.44Malt-Turbo-Mst 206 123 150 33 512 161 59 103 17 340 468 72.64 66.40 69.38

Malt-Turbo 311 33 179 26 549 213 20 118 11 362 468 77.35 65.93 71.18Hindi Malt-Mst 233 103 186 28 550 179 44 114 13 350 468 74.78 63.63 68.76

Turbo-Mst 246 115 187 33 581 174 45 113 15 347 468 74.14 59.72 66.15

Table 4.2 Simple strategy Error detection results

4.1.3.1 Simple ensemble method

Various experiments were conducted as per the approach detailed in the Section 4.1.1.1. Results arereported for all the three parsers and also for different combinations of parser pairs. Table 4.2 shows theresults for the experiments.

As shown in the table 4.2, most of the Type 1 errors are gold errors which supports our hypothesisthat parsers agree only if an annotation is consistent enough in the treebank. Thus, the result favors theensemble of parsers for error detection. Another observation is that some of the false-positives in theset of Type 1 errors are very close cases. It is observed that even for experts, some of these cases arehard to annotate correctly. Accurate annotation of these cases needs a thorough understanding of theannotation guidelines and also, a deeper understanding of the language under study. One of the majorissues in the creation of a treebank for any language is the evaluating the annotator’s understanding andtraining them accordingly with critical cases as per annotation guidelines. The false positive cases inour approach can be utilized to train and evaluate the annotators. In case of Type 2 errors, the number of

Treebank Parsers LAS UAS LAccMalt 78.23 90.74 80.77

Hindi MST 65.68 86.68 68.02Turbo 77.23 89.88 79.78

Malt 72.94 97.22 73.60Telugu MST 66.58 95.78 67.48

Turbo 65.46 94.16 66.27

Table 4.3 Parsing Results

30

gold errors made by our approach is not high. The observation of false-positives of Type 2 errors makesit clear that the validation time taken for these cases is low as experts can easily ignore the unintuitiveerrors detected. However, the number of gold errors in Type 2 may vary for different corpora, basedon the annotator’s understanding. For our dataset, false-positives in Type 2 resulted from data sparsity.This type of errors affects the precision of the system badly. Type 3 and Type 4 errors are detectedbased on the attachments. In dependency annotation, the annotators are good at annotating attachmentscompared to annotating labels. Supporting this intuition, attachment errors (Type 3 and Type 4) are less,as compared to label errors based on the results shown. Type 1 error example from Telugu treebank isshown below. Words or lexicons of the sentence are in the wx-notation as detailed in chapter 1. In thegiven example, we can observe that the all the parsers and the human annotation are agreeing on thesame attachment for the lexicon 2. The dependency label assigned by all the three parsers is ’vmod’ butthe human annotated label for the same attachment is ’k2’. Hence, as per our classification method weidentify the current instance as Type 1 error.

Human Annotation

1 vAdini vAdu PRP cat-pn|gen-m|num-sg|pers-3|case-|vib-nu|tam-nu 2 k2

2 pattukovadaM pattuko VM cat-v|gen-any|num-any|pers-any|case-|vib-adaM|tam-adaM 4 k2

3 nAcewa nenu PRP cat-pn|gen-any|num-sg|pers-1|case-|vib-ce|tam-ce 4 k1

4 kAlexu avvu VM cat-v|gen-any|num-any|pers-any|case-|vib-a lexu|tam-a lexu 0 main

Parsers Outputs

Malt 2 pattukovadaM pattuko VM 4 vmod 0.552

MST 2 pattukovadaM pattuko VM 4 vmod 0.617

Turbo 2 pattukovadaM pattukovadaM v 4 vmod

Results on the Hindi and Telugu treebanks shown in the table 4.2 support various additional hypothe-ses or intuitions. Firstly, better results for Hindi, as compared to the Telugu treebank, are expected dueto the stable and mature nature of the Hindi treebank. Parsing Results listed in the table 4.3 show thatMaltParser and TurboParser have better parsing results compared to MSTParser. On stacking all thethree parsers, the precision values for Telugu (32.18) and Hindi (66.40) are higher than taking a combi-nation of any of the two parsers. It aligns with our intuition that all the three parsers contribute towardsaccurate error detection as each parser has different abilities in different contexts of parsing scenarios.The recall values of error detection is higher in case of any two parser’s combination as compared to allthe three parsers, which is essentially saying that the more number of parsers stacked outputs is actuallyleading towards accuracy of the errors. Ideally, we need a high recall as we can trade off a bit on theprecision. We notice that the overall results of error detection are dependent on the quality of the parsersas well. It may also noted that a decrease in the performance values (Precision, Recall, and F-score)as we take less confident parsers. For example, Turbo-MaltParser combination is giving best possible

31

Treebank Parsers and Rank T1 T2 T3 T4 Errors D1 D2 D3 D4 Detected Existing Precision Recall F scoreMalt-Turbo-Mst-A 105 21 82 7 215 89 13 60 4 166 197 77.21 84.26 80.58Malt-Turbo-Mst-B 49 27 32 8 116 34 14 23 4 75 112 64.66 66.96 65.79

Hindi Malt-Turbo-Mst-C 21 24 12 7 64 14 13 9 4 40 66 62.5 60.61 61.54Malt-Turbo-Mst-D 31 51 24 11 117 24 19 11 5 59 93 50.43 63.44 56.20

Malt-Turbo-Mst-A 51 1 2 1 55 21 0 0 1 22 34 40.0 64.71 49.44Malt-Turbo-Mst-B 41 6 8 0 55 18 1 2 0 21 34 38.19 61.77 47.19

Telugu Malt-Turbo-Mst-C 61 44 3 0 108 17 15 1 0 33 63 30.56 52.38 38.60Malt-Turbo-Mst-D 139 293 16 2 450 52 84 2 1 139 257 30.89 54.09 39.32

Table 4.4 Ranking based on the confidence

results compared to the results of MST-Turbo, which are relatively lesser confident pair than any ofother parser pairs.

4.1.4 Ranking Based on Confidence Scores

The objective of ranking of these errors is to help in the process of correction. The detected errorswith high precision and recall (F-score) are corrected first in a resource poor environment so that thecorrection leads to maximum benefit of the proposed error detection tool. As shown in the empiricalresults, A-class errors detected have the highest precision and the recall values in both, the Telugu andthe Hindi treebanks.

Empirical results in Table 4.4 show the F-score curve for the ranked errors. A pie chart showing thevarious classes of errors for Hindi and Telugu treebanks is shown in Figure 4.4. We observe that A-classerrors are relatively higher compared to other classes. Ranking of the errors detected is important fordeveloping treebanks where there is lack of expert resources. In such situations, they have to correct theaccurately detected errors first. The plot in the Figure 4.4 shows that the number of errors detected foreach class. We observe that the A-class results (precision and recall) are better, as compared to the otherclasses, for both the Hindi and the Telugu treebanks.

4.1.5 Experiments based on Machine Learning methods

We used CSSVM classifier for classification based on the features extracted as described in Section4.1.1.4. The two classes are positive (anomalies) and negative (normal) classes. As the cost-sensitivesvm classifier requires weights to be defined for each class. To learn the weights, we ran the experi-ments on the development dataset with varying weights, for a positive class from 2 to 15, and by keepingthe weight of the negative class as 1. F-score values of the classifier were maximum at positive classweight value 7. The other two parameters gamma and alpha are assigned as 0.5 based on the experi-ments considering the search space of (0.1, 0.1 to 1, 1). We have experimented on Hindi and Telugutreebank datasets. For Hindi treebank, 40k token is used for training, 15k token as testing data and 5kfor development. In case of Telugu treebank dataset is divided into 25k, 10k and 5k for training, testingand development respectively.

32

Treebank Precision Recall F scoreHindi 53.46 75.0 62.42

Telugu 27.81 73.11 40.29

Table 4.5 CSSVM ML classifier results

Approach Precision Recall F scoreFBSM 6.20 18.74 9.31PBSM 24.05 57.06 33.83

PBSM(increased training data) 0 0 0EPBSM 28.88 73.42 41.45

EPBSM(increased training data) 29.40 77.34 42.60overall Hybrid system(rules and statistical) 30.31 78.64 43.75

overall Hybrid system((Increased data) 30.50 81.49 44.38

Table 4.6 Prior art

Result shown in the table 4.5 are comparable with the simple ensemble results shown in the table4.2. In general, machine learning methods rely on the data size and quality of the features. In spite ofour low size of the treebanks data proposed approaches are giving good results.

4.1.6 Comparison with the prior art

Early contributions in error detection in Indian languages treebanks include [6], [4] and [2]. Theseworks are detailed in chapter 2. They present various error detection systems like Frequency based sta-tistical module(FBSM), where the frequencies of given child-parent relations along with some features,fed to the max entropy method to compute the entropy scores and then the least frequent ones wereclassified as possible errors. Rule based systems are based on language specific rules in which one canput constraints on word, POS tag and existing dependency labels of parent and child. Later, [6],[4] and[2] replace the frequency values with the probability values and proposed probability based statisticalmodule (PBSM). They extended the probability based statistical module (PBSM) with the rule basedsystem to correct some of the errors before they find probabilities for the statistical module(PBSM). Theproposed extended probability based system(EPBSM) is combined with the rule based system to comeup with a hybrid error detection system. The results of the various systems proposed are shown in thetable 4.6.

Based on the results shown in Table 4.6, we can see that simple ensemble method is outperformingall the systems (FBSM, PBSM, and EPBSM) without using any extra language specific rules. F-score ofthe EPBSM system which is the considered the best statistical module, lesser than our simple ensembleapproach. We can find similar trends with respect to the precision and recall values as well withoutusing any language specific rules.

33

4.1.7 Conclusions and summary

In this chapter, we proposed the three approaches for error detection in annotated corpora namely:(a) simple ensemble method (b) Ranking based on the confidence score and (c) ML classifier, for errordetection in annotated corpora. Empirically, we have demonstrated the effectiveness of those methods bypresenting their error detection performance results in various experiments. Furthermore, we comparedthem with several existing tools in the literature.

34

Chapter 5

Error Correction

5.1 Introduction

We presented novel methods for dependency error detection in annotated corpora as detailed in thechapters 3 and 4. These errors are manually validated and corrected by the experts to enhance thequality of the corpora. Since expert’s time is costly, we need auto correction tools so that expert canspend his precious time on the complex error corrections. In this chapter, we try out various possibilitiesto correct the errors detected and also propose a method to auto correct the annotated corpora bypassingthe intermediate detection step. As per our knowledge, this is the first attempt to correct the annotatedcorpora of Indian languages. The observation of the dependency errors detected in annotated corpora,tells us that the errors are firstly due to ambiguity among the tags to assign an appropriate tag in agiven context. As more context comes into account, the ambiguity will be diminished to select the righttag. The other major part of these errors includes human errors which are due to bias, boredom andannotators and annotators lack of understanding the annotation schema. Some of these errors are easilycorrectable based on the predictions of the tools which try to model the consistency of annotation in thecorpora.

Motivation for the automatic error correction tools is majorly complementing and assisting the pro-cess of manual validation of the detected errors by experts. Furthermore, it offers a quick solution toenhance the quality of the corpus but there are some major issues with the auto correction as well.Firstly, it modifies the original annotation information if the auto correction is not 100 percent accuratewhich is a case most of the times. Since the auto correction tools use consistency properties of the cor-pora to correct the errors, it may induce new errors which are consistent with the corpora and difficult toeliminate in future. The first issue can be addressed by maintaining the actual annotations with differentversions of corpora or by adding the corrected information with the new tag in the corpora. As, thepurpose of auto corrected corpora is majorly to assist the manual validation and correction process, theaforementioned issues are not a problem. As long as we are not considering the complete auto correctedcorpus as a gold standard these issues won’t arise. Major goal of the auto correction is to distinguishbetween the errors where the tool can accurately auto correct and those errors where human intervention

35

is needed to correct them. Early works on auto correction are proposed by [24] and [26] which correctsPOS and dependency errors in the corpora respectively.

5.2 Error Correction Methods

In this section, we attempt to analyze the correction of dependency errors in Indian language depen-dency treebanks. As we already detailed in chapter 1, Indian languages are relatively more free wordorder languages compared to other positional languages like English. Hence, directional or positionalfeatures can be of little help on learning the syntactic patterns of the language. Dependency error cor-rection involves detecting and correcting the attachment and label errors in the annotated corpora. Inthis chapter, we propose two different approaches. Firstly, we try to pool the parsers outputs and find outa heuristic to auto correct the dependency label errors in the corpora and later we extended this methodto enhance the accuracy of correction by using confidence scores available for the parsed outputs. Sec-ondly, we try to address the correction problem as a classification problem using simple context featuresof the data. To evaluate these methods, we used the test data set for which we have a gold standard data.As we have seen from the results presented in chapter 4 under section 4.1.1.1, attachment errors are min-imal or rare compared to label errors in dependency treebank annotation. Even in automatically parsedannotated corpora, attachment errors are lesser as we can observe from the parsing results presented inthe table 4.3. The proposed approaches are as follows.

5.2.1 Correction based on parsers outputs

After detecting the errors using parsers outputs as detailed in the chapter 4, we extended these meth-ods to automatically correct the dependency errors in the corpora. As stated earlier in the chapter 4under section 4.1, we hypothesize that all the parsers agree on a dependency label and differs with hu-man annotation are detected as Type 1 errors. To correct these errors, we assign the dependency labelson which all the three parsers agreed on. The intuition here is that the parsers ability to model theconsistency of annotation in the data is enormous, hence if all the three parsers which have differentmathematical modeling capabilities are assigning same label will be a good possibility of correction.Type 2 errors are due to ambiguity we can’t use any information to correct these errors.

Later, the proposed method is extended using confidence scores of the parsers to correct the errorsmore accurately. This method is based on the hypothesis that the parser’s higher confidence scoresindicate the lesser confusion for a given instance. Hence, we correct the erroneous instances where bothMalt and MST parsers are having lesser confusion and also agrees on the same label as given by theTurboParser. This constraint further increases the accuracy of corrected errors. We have used the samethresholds on the Malt and MST parsers to identify confident parses as listed in the chapter 4 undersection 4.1.1.2.

36

5.2.2 Learning Models Approach

In this approach, we projected the error correction problem as a classification problem and we aretrying to identify the problematic scenarios and predict the correct label. This method is different fromthe dependency parsing problem in which we try to parse the entire sentence and then assign labelsto them as detailed in [58], [48] and [47]. In this approach, we try to represent the dependency dataas a child-parent pair of words and assign a dependency relation label to the pair. We stack the localcontext features such as POS tags and vibhakti values to accurately predict the label. Since, we do notconcern about the attachment errors as described earlier, we need not bother about the internal sentencestructure. We use memory based learning(MBL) method to learn and classify the labels. Memory-basedlearning (MBL) [22], is based on the notion that keeping all training instances in memory will aid inthe classification task. MBL has been used for a variety of classification tasks, including part-of-speechtagging, base noun phrase tagging, and parsing (see, e.g., [21], [42])). Little abstraction is done instoring the training data, and the testing data is classified based on the similarity to the examples inmemory, using a similarity metric.

We have used local context-aware features like actual lexicons, root words, POS tag, vibhakti tagand tried out various combinations among these features to find the nearest correct dependency label.Results of the experiments are detailed in the section 5.3.

5.3 Experiments and Discussion

For our experiments on error correction, we used the same Hindi dataset as described earlier in thetable 4.1. To evaluate the results on error correction, we have defined two metrics. Firstly, changedprecision, which is the ratio between correctly changed instances and total changed instances. Othermetric is correction precision on test dataset which is computed by taking the ratio between erroneousinstances and total instances in the test dataset.

5.3.1 Correction based on parsers outputs

In this approach, parsed outputs of all the three parsers using N-fold cross validation is used to correctthe errors. In the first experiment, we corrected human annotation using the dependency label on whichall the three parsers are agreed on. In the extended method, out of the complete set of errors correctedin earlier method, we corrected only subset of instances on which Malt and MST parsers are confident.These confident parses are identified by keeping thresholds on Malt and MST parsers confidence scores.Results of these experiments are listed in the table 5.1. We can observe that changed precision of thecorpus is 70.31 percent which is increased after using the confidence scores in align with our intuition.Correction precision of test dataset is increased compared to the baseline value of 88.60. Similarly, wecan notice the increments on precision result using confidence score values. The small fall in correctionprecision value of test dataset after using confidence scores indicates that the recall value of test dataset

37

Method Changed Precision Correction Precision on test data Baseline Precision of test dataMalt,MST,Turbo Parses 70.31 92.29 88.59

Malt,MST,Turbo Parses and Confidence scores 73.76 91.70 88.59

Table 5.1 Error correction results based on parsed outputs

Feats Context Changed Precision Correction Precision on test data set Baseline Precision of test dataWord - 29.98 81.96 88.59

Word,POS - 32.96 83.62 88.59Word,Vib - 45.18 88.42 88.59

Word,POS,Vib - 44.93 88.56 88.59

Table 5.2 Error correction results based on learning models

decreased. It means that the overall number of errors corrected after enforcing confidence constraintis lesser in the corpora. The constraint using confidence scores is in a way increasing our accuracy ofcorrected errors at the cost of recall value.

Learning Models Approach

In our experiments, we have used the default settings of the TiMBL where k value is 2. Using N-foldcross validation we have classified or corrected the entire corpora. Here, the value of N is chosen as6. To evaluate the proposed method, we computed the performance of the classifier in auto correctionusing test dataset. All possible combinations of the features are experimented to auto correct the entirecorpus. Table 5.2 shows the results of the experiments on Hindi dataset.

The results shown in table 5.2 clearly indicates that the word pairs without any context is resultingin a huge loss of precision compared to the original corpus. As we can observe from the test datasetresults, just adding the POS tag information along with the word is enhancing the precision of theauto correction but still there is a gap between baseline precision and current result which is around 5percent. Later, when we stack the vibhakti feature instead of POS tag, the results are enhanced and nowthe precision is close to the baseline precision. This result validates the fact that the vibhakti featureplays a major role in determining the dependency relations in the context of Indian language treebanks.Finally, when we stack all these features final precision of test dataset is close enough to the originalcorpus precision.

If we ponder around the pros and cons of the approaches presented, parsers output based correctionis resulted in achieving the good changed precision value but the coverage of the correction acrosscorpora is a concern. Further, the coverage decreased when we use the confidence scores to enhancethe precision of the corrections made. These heuristics are not good enough to build a robust toolsto auto correct the entire corpus. Later, we have tried out more generic and robust methods based onthe learning of the error patterns in the corpora to correct them automatically. The results obtainedusing various features are not enough to push the boundaries of the original precision of the corpus. Toimprove on the precision of the corrected corpus we further need to learn generic models to constrainthe local features-based-models.

38

5.4 Summary

In this chapter, we try to set the base or platform to build auto correction for Indian languages.The early attempts reveal that the path to build error correction tools is different from the dependencyparsing. Proposed methods try to push the boundaries of the auto correction to achieve the baselineaccuracy without using any kind of heuristics.

39

Chapter 6

Conclusions

In this work, we address the issues in the validation process of annotated corpora. The validationprocess is time consuming and expensive. Additionally, it needs expert’s time rather than annotator’sand this makes it even more expensive. Also, it is difficult to find the expert resources to finish thevalidation process. As expert’s time is more expensive and valuable, we need the tools to reduce theexpert’s time needed to validate annotated corpora by automatically detecting the possible anomaliesor errors from the treebank rather than going through the entire corpora. We proposed a solution usingthe existing state of the art parsers to detect the errors in the annotated corpora. The keen observationof some of the errors detected, actually can be corrected automatically using simple heuristics on theparsed outputs. These auto correction tools further reduce the experts time to validate the corpora.Further, we explored the ways to auto correct the errors in the annotated corpora and leveraged parsersoutputs to correct the detected errors based on some heuristic. Thereafter, we experimented with somefeature based learning models to address the auto correction of the corpora.

Our approaches are based on the intuition that parsers learn the patterns in the treebank more ex-tensively based on the inherent complex techniques. Furthermore, each parser has different capabilitiesof learning various aspects of the annotated corpora. In our approaches, we try to leverage the parsersoutput to detect the errors in annotated corpora. Firstly, we tried the simple ensemble approach for errordetection by comparing the parsers labels with the human annotation. The intuition behind this ap-proach is that the parsers agreement on the same label which differs with human annotation is a possibleinstance of the error. On the other end, disagreement of all the parsed labels among them and with thehuman annotation points out some ambiguity in the instance. Later, we ranked the errors based on theparsing score assigned for each instance in the annotated corpora. Lastly, we proposed a ML classifierapproach which classifies each instance in the annotated corpora based on the features extracted fromvarious parsers outputs.

The simple ensemble approach is giving a promising result on both the treebanks. The F-scorevalues of the proposed tool on Telugu and Hindi treebanks are 40.71 and 69.38. The Machine Learningclassifier results are also in sync with the simple ensemble approach. ML classifiers perform better andaccurate as the size of the treebank exceeds. The ML classifier clearly outperforms the prior works on

40

the error detection tool. Proposed approaches are independent of the languages as these approaches arenot using any language specific information.

To auto correct the errors detected in the corpora, we proposed parsed outputs based approach whichincreased the test data set accuracy. Later, we leveraged the confidence scores of Malt and MST parsersto achieve better precision values of the corrections suggested. The precision value got increased fortest dataset. The major issue with this correction method is that it is not robust to correct all the errors asimplicitly they are just trying to correct a subset of the errors detected. Hence, we go on experimentingwith the context-aware feature-based models approach in which we try to learn the problematic instancesof corpora.

In the process of error detection, our approaches outperformed the state of the art statistical systems.Major issues in the error detection method is that the parsers accuracy values determine the quality ofthe error detection. In the context of auto correction, the proposed approaches are just the basic. Hence,there is a lot of scope for future work. We have experimented with different feature-based-models, onecan try out how these models can complement each other to push the boundaries of the error correctionprecision values. Deep learning based models can yield good results in this context particularly usingunsupervised auto-encoders 1 may help in the context of identifying the problem cases in the annotatedcorpora.

1https://en.wikipedia.org/wiki/Autoencoder

41

Related Publications

Annamaneni Narendra, Riyaz Ahmad Bhat, and Dipti Misra Sharma.Ensembling DependencyParsers for Treebank Error Detection. In Proceedings of the Twelfth Workshop on Treebanks and Lin-guistic Theories (TLT12), pages 120-132, Treebanks and Linguistic Theories, 2013

42

Appendix A

TagSets

A.1 POS Tagset

The following is the list of POS and Chunk tags being used for Hindi treebank (under development).

Sl No. Category Tag name Example

1.1 Noun NN

1.2 NLoc NST

2. Proper Noun NNP

3.1 Pronoun PRP

3.2 Demonstrative DEM

4 Verb-finite VM

5 Verb Aux VAUX

6 Adjective JJ

7 Adverb RB *Only manner adverb

8 Post position PSP

9 Particles RP bhI, to, hI, jI, hA.N, na,

10 Conjuncts CC bole (Bangla)

11 Question Words WQ

12.1 Quantifiers QF bahut, tho.DA, kam (Hindi)

12.2 Cardinal QC

12.3 Ordinal QO

12.4 Classifier CL

13 Intensifier INTF

14 Interjection INJ

15 Negation NEG

16 Quotative UT ani (Telugu), mAne (Hindi)

43

17 Sym SYM

18 Compounds *C

19 Reduplicative RDP

20 Echo ECH

21 Unknown UNK

A.2 Chunk Tagset

Sl. No Chunk Type Tag Name Example

1 Noun Chunk NP Hindi: ((merA nayA ghara)) NP my new house

2.1 Finite Verb Chunk VGF Hindi: mEMne ghara para khAnA((khAyA VM)) VGF

2.2 Non-finite Verb Chunk VGNF Hindi: mEMne ((khAte khAte VM)) VGNF ghodeko dekhA

2.3 Infinitival Verb Chunk VGINF Bangla : bindu Borabela ((snAna karawe)) VGINFBAlobAse

2.4 Verb Chunk (Gerund) VGNN Hindi: mujhe rAta meM ((khAnA VM)) VGNN ac-chA lagatA hai

3 Adjectival Chunk JJP Hindi: vaha laDaZkI hE((suMdara JJ sI RP)) JJP

4 Adverb Chunk RBP Hindi : vaha ((dhIre-dhIre RB)) RBP cala rahA thA

5 Chunk for Negatives NEGP Hindi: ((binA)) NEGP ((kucha)) NP ((bole)) VG((kAma)) NP ((nahIM calatA)) VG

6 Conjuncts CCP Hindi: ((rAma)) NP ((Ora)) CCP ((SyAma)) NP

7 Chunk Fragments FRAGP Hindi; rAma (jo merA baDZA bhAI hE) ne kahA.

8 Miscellaneous BLK

For complete description, see the guidelines:http://ltrc.iiit.ac.in/MachineTrans/publications/technicalReports/tr031/posguidelines.pdf

A.3 Dependency Tag Set

The following is the list of core dependency tags that are being used.

44

No. Tag Name Tag description Example

1.1 k1 karta (doer/agent/subject) rAma bETA hE

1.2 pk1 prayojaka karta (Causer) Az ne bacce ko KanA KilAyA

1.3 jk1 prayojya karta (causee) mAz ne AyA se bacce ko KAnAKilavAyA

1.4 mk1 madhyastha karta (mediator-causer) mAz ne AyA se bacce ko KAnAKilavAyA

1.5 k1s vidheya karta (karta samanadhikarana) rAma buxXimAna hE

2.1 k2 karma (object/patient) rAma rojZa eka seba KAwA hE

2.2 k2p Goal, Destination rAma Gara gayA

2.3 k2g gauna karma (secondary karma) ve loga gAMXIjI ko bApU BI kahawehEM

2.4 k2s karma samanadhikarana (object com-plement)

rAma mohana ko buxXimAna sama-JawA hE

3 k3 karana (instrument) rAma ne cAkU se seba kAtA

4.1 k4 sampradaana (recipient) rAma ne mohana ko Kira xI

4.2 k4a anubhava karta (Experiencer) muJako rAma buxXimAna lagawA hE

5.1 k5 apaadaana (source) rAma ne cammaca se katorI se KiraKAyI

5.2 k5prk prakruti apadana (‘source material’ inverbs denoting change of state)

jUwe camade se banawe hEM

6.1 k7t kaalaadhikarana (location in time) rAma xilli meM rahawA hE

6.2 k7p deshadhikarana (location in space) mejZa para kiwAba hE

6.3 k7 vishayaadhikarana (location else-where)

ve rAjanIwi para carcA kara rahe We

7 k*u saadrishya (similarity) [[k1u]] rAXA mIrA jEsI sunxara hE

8.1 r6 shashthi (possessive) sammAna kA BAva

8.2 r6-k1, karta or karma of a conjunct [r6-k1] kala manxira kA uxGAtanahuA

8.3 r6v (’kA’ relation between a noun and averb)

rAma ke eka betI hE

9 adv kriyaavisheshana (‘manner adverbs’only)

vaha jalxI jalxI liKA rahA WA

10 sent-adv Sentential Adverbs isake alAvA, BakaPA (mAovAxI) kerAmabacana yAxava ko giraPZawArakara liyA gayA

11 rd prati (direction) sIwA gAzva kI ora jA rahI WI

45

12 rh hetu (cause-effect) mEne mohana kI vajaha se kiwAbaKArIxI

13 rt taadarthya (purpose) mEne mohana ke liye kiwAba KArIxI

14.1 ras-k* upapada sahakaarakatwa (associative) rAma apane pIwAji ke sAWa bAjZAragayA

14.2 ras-neg Negation in Associatives rAma pIwAjI ke binA gayA

15 rs relation samanadhikaran (noun elabo-ration)

bAwa yaha hE ki vo kal nahIM AyegA

16 rsp relation for duratives 1990 se lekara 2000 waka BArawa kIpragawi wejZa rahI

17 rad Address words mAz muJe kala xillI jAnA hE

18 nmod relc, Relative clauses, jo-vo constructions merI bahana [ jo xillI meM rahawI hE]kala A rahI hE

jjmod relc,rbmod relc

19 nmod Noun modifier (including participles) pedZa para bETI cidZiyA gAnA gArahI WI

20 vmod Verb modifier vaha KAwe hue gayA

21 jjmod Modifiers of the adjectives halkI nIlI kiwAba

22 pof Part of relation rAma ravi kI prawIkSA kara rahA WA.

23 ccof Conjunct of relation rAma seba KAwA hE Ora sIwA xUXapIwI hE

24 fragof Fragment of BAkaPA (mAovAxI) ke rAmabacanayAxava ko giraPZawAra kara liyAgayA

25 enm Enumerator Apa apanA kara samaya se xe sakawehEM

For complete description, see: http://ltrc.iiit.ac.in/MachineTrans/research/tb/DS-guidelines/DS- guidelines-ver2-28-05-09.pdf

46

Bibliography

[1] R. Agarwal, B. R. Ambati, and D. M. Sharma. A hybrid approach to error detection in a treebank and its

impact on manual validation time. Linguistic Issues in Language Technology, 7(1), 2012.

[2] B. Agrawal, R. Agarwal, S. Husain, and D. M. Sharma. An automatic approach to treebank error detection

using a dependency parser. In Computational Linguistics and Intelligent Text Processing, pages 294–303.

Springer, 2013.

[3] R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual NLP. In

Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 183–192,

Sofia, Bulgaria, August 2013. Association for Computational Linguistics.

[4] B. R. Ambati, R. Agarwal, M. Gupta, S. Husain, and D. M. Sharma. Error detection for treebank validation.

Asian Language Resources collocated with IJCNLP 2011, page 23, 2011.

[5] B. R. Ambati, P. Gadde, and K. Jindal. Experiments in Indian language dependency parsing. Proceedings

of the ICON09 NLP Tools Contest: Indian Language Dependency Parsing, pages 32–37, 2009.

[6] B. R. Ambati, M. Gupta, S. Husain, and D. M. Sharma. A high recall error identification tool for Hindi

treebank validation. In LREC, 2010.

[7] J. P. Baker. Consistency and accuracy in correcting automatically tagged data. Garside et al, pages 243–250,

1997.

[8] R. Begum, S. Husain, A. Dhwaj, D. M. Sharma, L. Bai, and R. Sangal. Dependency Annotation Scheme

for Indian Languages. In IJCNLP, pages 721–726. Citeseer, 2008.

[9] A. Bharati, V. Chaitanya, R. Sangal, and K. Ramakrishnamacharyulu. Natural language processing: a

Paninian perspective. Prentice-Hall of India, New Delhi, 1995.

[10] A. Bharati and R. Sangal. Parsing free word order languages in the Paninian framework. In Proceedings

of the 31st annual meeting on Association for Computational Linguistics, pages 105–111. Association for

Computational Linguistics, 1993.

[11] A. Bharati, R. Sangal, and D. M. Sharma. SSF: Shakti Standard Format guide. Language Technologies

Research Centre, International Institute of Information Technology, Hyderabad, India, pages 1–25, 2007.

[12] A. Bharati, D. M. Sharma, L. Bai, and R. Sangal. AnnCorra: Guidelines for POS and chunk annotation for

Indian languages. Technical report, IIIT Hyderabad, 2006.

47

[13] A. Bharati, D. M. Sharma, S. Husain, L. Bai, R. Begam, and R. Sangal. AnnCorra: Treebanks for Indian

languages, guidelines for annotating Hindi treebank, 2009.

[14] R. A. Bhat and D. M. Sharma. Non-projective structures in Indian language treebanks. In Proceedings of

the 11th Workshop on Treebanks and Linguistic Theories (TLT11), pages 25–30, 2012.

[15] S. Brants, S. Dipper, S. Hansen, W. Lezius, and G. Smith. The TIGER treebank. In Proceedings of the

workshop on treebanks and linguistic theories, volume 168, 2002.

[16] X. Carreras. Experiments with a Higher-Order Projective Dependency Parser. In EMNLP-CoNLL, pages

957–961, 2007.

[17] Y. J. Chu and T.-H. Liu. On shortest arborescence of a directed graph. Scientia Sinica, 14(10):1396, 1965.

[18] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks

with multitask learning. In Proceedings of the 25th International Conference on Machine learning, pages

160–167. ACM, 2008.

[19] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, 1990.

[20] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[21] W. Daelemans, J. Zavrel, K. Van der Sloot, and A. Van den Bosch. TiMBL: Tilburg Memory Based Learner-

version 2.0-Reference Guide. 1999.

[22] W. Daelemans, J. Zavrel, K. van der Sloot, and A. Van den Bosch. TiMBL: Tilburg Memory-based Learner.

Tilburg University, 2004.

[23] D. De Kok, J. Ma, and G. Van Noord. A generalized method for iterative error mining in parsing results. In

Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks, pages 71–79. Association

for Computational Linguistics, 2009.

[24] M. Dickinson. From Detecting Errors to Automatically Correcting Them. In EACL, 2006.

[25] M. Dickinson. Determining Ambiguity Classes for Part-of-Speech Tagging. Proceedings of RANLP-07.

Borovets, Bulgaria, 2007.

[26] M. Dickinson. Correcting dependency annotation errors. In Proceedings of the 12th Conference of the

European Chapter of the Association for Computational Linguistics, pages 193–201. Association for Com-

putational Linguistics, 2009.

[27] M. Dickinson and W. D. Meurers. Detecting errors in part-of-speech annotation. In Proceedings of the

Tenth Conference on European chapter of the Association for Computational Linguistics-Volume 1, pages

107–114. Association for Computational Linguistics, 2003.

[28] M. Dickinson and W. D. Meurers. Detecting inconsistencies in treebanks. In Proceedings of TLT, volume 3,

pages 45–56, 2003.

[29] M. Dickinson and W. D. Meurers. Detecting errors in discontinuous structural annotation. In Proceedings

of the 43rd Annual Meeting on Association for Computational Linguistics, pages 322–329. Association for

Computational Linguistics, 2005.

48

[30] J. Edmonds. Optimum branchings. Journal of Research of the National Bureau of Standards B, 71(4):233–

240, 1967.

[31] J. M. Eisner. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of

the 16th Conference on Computational linguistics-Volume 1, pages 340–345. Association for Computational

Linguistics, 1996.

[32] E. Eskin. Automatic corpus correction with anomaly detection. In Proceedings of NAACL, pages 148–153,

2000.

[33] L. Georgiadis. Arborescence optimization problems solvable by edmonds algorithm. Theoretical Computer

Science, 301(1):427–437, 2003.

[34] R. Gupta, P. Goyal, and S. Diwakar. Transliteration among Indian languages using WX notation. In KON-

VENS, pages 147–150, 2010.

[35] N. Habash, R. Gabbard, O. Rambow, S. Kulick, and M. P. Marcus. Determining Case in Arabic: Learning

Complex Linguistic Behavior Requires Complex Linguistic Features. In EMNLP-CoNLL, pages 1084–

1092, 2007.

[36] D. Hogan. Coordinate noun phrase disambiguation in a generative parsing model. Association for Compu-

tational Linguistics, 2007.

[37] R. Hudson. Word grammar: Blackwell oxford. 1984.

[38] S. Jain and B. Agrawal. A dynamic confusion score for dependency arc labels. In IJCNLP, pages 1237–

1242, 2013.

[39] K. Kaljurand. Checking treebank consistency to find annotation errors, 2004.

[40] T. Koo and M. Collins. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting

of the Association for Computational Linguistics, pages 1–11. Association for Computational Linguistics,

2010.

[41] V. Kordoni. Strategies for annotation of large corpora of multilingual spontaneous speech data. In Proc. of

Workshop on Multilingual Corpora: Lin. Citeseer, 2003.

[42] S. Kubler. Memory-based parsing, volume 7. John Benjamins Publishing, 2004.

[43] P. Kukkadapu, D. K. Malladi, and A. Dara. Ensembling various dependency parsers: Adopting Turbo parser

for Indian languages. In 24th International Conference on Computational Linguistics, page 179, 2012.

[44] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The

Penn Treebank. Computational linguistics, 19(2):313–330, 1993.

[45] A. F. Martins, N. A. Smith, P. M. Aguiar, and M. A. Figueiredo. Dual decomposition with many overlapping

components. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,

pages 238–249. Association for Computational Linguistics, 2011.

[46] A. F. Martins, N. A. Smith, and E. P. Xing. Concise integer linear programming formulations for depen-

dency parsing. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th

49

International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages

342–350. Association for Computational Linguistics, 2009.

[47] A. F. Martins, N. A. Smith, E. P. Xing, P. M. Aguiar, and M. A. Figueiredo. Turbo parsers: Dependency

parsing by approximate variational inference. In Proceedings of the 2010 Conference on Empirical Methods

in Natural Language Processing, pages 34–44. Association for Computational Linguistics, 2010.

[48] R. McDonald, K. Crammer, and F. Pereira. Online large-margin training of dependency parsers. In Pro-

ceedings of the 43rd annual meeting on association for computational linguistics, pages 91–98. Association

for Computational Linguistics, 2005.

[49] R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. Non-projective dependency parsing using spanning tree

algorithms. In Proceedings of the Conference on Human Language Technology and Empirical Methods in

Natural Language Processing, pages 523–530. Association for Computational Linguistics, 2005.

[50] R. McDonald and G. Satta. On the complexity of non-projective data-driven dependency parsing. In

Proceedings of the 10th International Conference on Parsing Technologies, pages 121–132. Association

for Computational Linguistics, 2007.

[51] R. T. McDonald and J. Nivre. Characterizing the Errors of Data-Driven Dependency Parsing Models. In

EMNLP-CoNLL, pages 122–131, 2007.

[52] A. Mejer and K. Crammer. Are you sure?: confidence in prediction of dependency tree edges. In Pro-

ceedings of the 2012 Conference of the North American Chapter of the Association for Computational

Linguistics: Human Language Technologies, pages 573–576. Association for Computational Linguistics,

2012.

[53] I. A. Melcuk. Dependency syntax: theory and practice. SUNY press, 1988.

[54] K. P. Mohanan. Grammatical relations and clause structure in Malayalam. The mental representation of

grammatical relations, 504:589, 1982.

[55] K. P. Mohanan and T. Mohanan. Issues in word order in South Asian languages: Enriched phrase structure

or multidimensionality. Theoretical perspectives on word order in South Asian languages, (50):153, 1994.

[56] J. Nivre. Inductive Dependency Parsing of Natural Language Text. PhD thesis, School of Mathematics and

Systems Engineering, Vaxjo University, 2005.

[57] J. Nivre. Algorithms for deterministic incremental dependency parsing. Computational Linguistics,

34(4):513–553, 2008.

[58] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kubler, S. Marinov, and E. Marsi. Maltparser:

A language-independent system for data-driven dependency parsing. Natural Language Engineering,

13(02):95–135, 2007.

[59] J. Nivre, J. Hall, J. Nilsson, G. Eryiit, and S. Marinov. Labeled pseudo-projective dependency parsing with

support vector machines. In Proceedings of the Tenth Conference on Computational Natural Language

Learning, pages 221–225. Association for Computational Linguistics, 2006.

50

[60] J. Nivre and J. Nilsson. Pseudo-projective dependency parsing. In Proceedings of the 43rd Annual Meeting

on Association for Computational Linguistics, pages 99–106. Association for Computational Linguistics,

2005.

[61] L. Padro and L. Marquez. On the evaluation and comparison of taggers: the effect of noise in testing

corpora. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and

17th International Conference on Computational Linguistics-Volume 2, pages 997–1002. Association for

Computational Linguistics, 1998.

[62] M. Pedersen, D. Eades, S. K. Amin, and L. Prakash. Relative clauses in Hindi and Arabic: A Paninian

dependency grammar analysis. In Proceedings of the Twentiet International Conference on Computational

Linguistics., Geneva, pages 17–24, 2004.

[63] S. M. Shieber. Evidence against the context-freeness of natural language. Springer, 1985.

[64] K. Singla, A. Tammewar, N. Jain, and S. Jain. Two-stage approach for Hindi dependency parsing using

maltparser. Training, 12041(268,093):22–27, 2012.

[65] D. A. Smith and J. Eisner. Dependency parsing by belief propagation. In Proceedings of the Conference

on Empirical Methods in Natural Language Processing, pages 145–156. Association for Computational

Linguistics, 2008.

[66] H. Van Halteren. The detection of inconsistency in manually tagged text. na, 2000.

[67] G. Van Noord. Error mining for wide-coverage grammar engineering. In Proceedings of the 42nd Annual

Meeting on Association for Computational Linguistics, page 446. Association for Computational Linguis-

tics, 2004.

[68] Wikipedia. Telugu language — wikipedia, the free encyclopedia, 2016. [Online; accessed 18-November-

2016].

51


Recommended