+ All Categories
Home > Documents > Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova,...

Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova,...

Date post: 14-Dec-2015
Category:
Upload: iris-vercoe
View: 223 times
Download: 4 times
Share this document with a friend
Popular Tags:
28
Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff BulTreeBank Project LML, Bulgarian Academy of Sciences (www. bultreebank.org) Workshop on Balkan Language Resources and Tools 2003 21 November 2003 Thessaloniki, Greece
Transcript
Page 1: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Language Resources and Tools for the Creation of a Bulgarian Treebank

Kiril Simov, Petya Osenova,

Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff

BulTreeBank Project

LML, Bulgarian Academy of Sciences

(www. bultreebank.org)

Workshop on Balkan Language Resources and Tools 2003

21 November 2003 Thessaloniki, Greece

Page 2: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Plan of the talk

• Preliminary Notes

• BulTreeBank Language Resources and Tools

• The integration architecture of the resources and tools

• Conclusion and Future work

Page 3: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Financial Support

BulTreeBank is a joint project betweenSeminar für Sprachwissenschaft,

Eberhard-Karls-Universität, Tübingen, Germanyand

Linguistic Modelling Laboratory,Bulgarian Academy of Sciences, Sofia, Bulgaria

The project is funded by the Volkswagen-Stiftung, Germany

Page 4: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Expected Results• A set of Bulgarian sentences marked-up with detailed

syntactic information

• A core set of sentences designated inside the treebank

• A linguistically interpreted text archive for Bulgarian

• A reliable partial grammar for automatic parsing of phrases in Bulgarian

• Software modules for compiling, manipulating and exploring the language resources

Page 5: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Preliminary notes (1)

We rely on two prerequisites during the process of our treebank creation:– integration of the pre-processing

components

– an adequate annotation scheme

Page 6: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Preliminary notes (2)

Integration is performed with the help of the following techniques:– Looking-forward strategy

• Adaptive mechanism• Additive mechanism

– Looking-backward strategy– Creation of a gold standard

Page 7: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Language Resources

• Text archive

• Morphological dictionary

• Gazetteers

• Valence dictionary

• Semantic dictionary

• Treebank

Page 8: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The BulTreeBank Text Archive

• A collection of linguistically interpreted texts from different genres (target size: 100 million words)

• About 72 million running words are converted into XML documents, marked up in conformance with the TEI guidelines

• 10 million running words are morphologically analyzed

• Over 1 000 000 words are morphosyntactically disambiguated by hand

Page 9: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The morphological dictionary

• Published as a book – Popov, Simov and Vidinska, 1998

• It covers the grammatical information of about 100 000 lexemes (1 600 000 word forms) and serves as a basis for the morphological analyzer

• The problem of the unknown words: open classes (names, abbreviations) and derivational models (diminutives etc)

Page 10: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The Gazetteers

• Gazetteers of namesconsisting of 15 000 words – Bulgarian and foreign person names, locations from the whole world, organizations, and others

• Gazetteers of the most frequent abbreviations

consisting of 1500 acronyms and graphical abbreviations

• Gazetteers of 300 most frequent introductory expressions and parentheticals. This is considered to be a step towards a basic list of collocations

Page 11: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The Valence Dictionary

• It consists of 1000 verbs and their valence frames• The frames of the most frequent verbs are

compared to the corpus data and repaired if necessary (new frames added, old ones deleted or more fine-grained)

• The semantic restrictions over the arguments are extracted and matched against the SIMPLE ontology (recall the Semantic Dictionary)

Page 12: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Lexical Entry of the Valence Dictionary

Verb, its transitivity and aspectMeaningI. Frame (the arguments that the verb requires)

S(ubject) + P(redicate) + O2(indirect object) | C(lause)

II. Morphology of the verb's argumentsS(ubject)=N,PerPron

III. Semantics of the argumentsS(ubject) is a person

IV. Examples of the verb's usage

Page 13: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The Semantic Dictionary

• Classification of the most frequent nouns with respect to the ontological hierarchy of SIMPLE without specifying the synonymic relations between them (3 000 nouns)

• The proper names from the gazetteers are also mapped to the ontological hierarchy of SIMPLE

Page 14: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The Treebank

• Core set of sentences (1 500 sentences) - extracted mainly from Bulgarian grammars and processed manually --> highest quality

• Treebank (6 000 sentences) - extracted mainly from the corpus and pre-processed automatically before treated manually

Page 15: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Core set of sentences: Example of a Pragmatic Adjunct

Page 16: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

A Corpus Sentence: an example of dependents realisation

Page 17: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

The Tools

• Morphological analyzer

• Disambiguator(s)

• Partial grammars

– sentence splitter

– named-entity recognition module

– chunkers

Page 18: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Morphological Analyzer

• Assigns all possible analyses to the tokens

• Implemented in CLaRK System as a regular grammar

• Works together with the ‘token classification’ strategy and with the gazetteers

Page 19: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Disambiguator(s)

• Rule-based disambiguator - a preliminary version of a rule-based morpho-syntactic disambiguator, encoded as a set of constraints within the CLaRK system --> 80 % coverage

• Neural-network-based disambiguator (Simov and Osenova 2001). Its accuracy is of 95.25 % for part-of-speech and 93.17 % for complete morpho-syntactic disambiguation

Page 20: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

After the MorphoSyntactic Analysis and Disambiguation

<w><ph>Човек</ph> <aa>Ncmsi</aa><ta>Ncmsi</ta></w><w><ph>с</ph> <aa>R</aa><ta>R</ta></w><w><ph>опит</ph> <aa>Ncmsi;Vppt+cv--smi</aa><ta>Ncmsi</ta></w><w><ph>и</ph> <aa>C</aa><ta>C</ta></w><w><ph>богато</ph> <aa>Ansi;D</aa><ta>Ansi</ta></w><w><ph>минало</ph> <aa>Ansi;Ncnsi;Vppt+caosni</aa><ta>Ncnsi</ta></w>

Page 21: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Named-entity recognition

Based on the information from the gazetteers and on RE rules:

• numerical expressions

• names

• abbreviations

• special symbols

Page 22: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

After the application of Gazetteers

<np sort="NE-Org"> <w><ph>Бъдеще</ph><ta>Ncnsi</ta></w> <pp> <w><ph>за</ph><ta>R</ta></w> <w

sort="NE-Loc"><ph>България</ph><ta>Ncfsi</ta></w> </pp></np>

<np sort="NE-Pers"> <w>Димитър</w><w>Калчев</w></np>

Page 23: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Chunkers: General Assumptions

• Deals with non-recursive constituents

• Relies on a clear-indicator strategy

• Delays the attachment decisions

• Ignores semantic information

• Aims at accuracy, not coverage

Page 24: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Chunkers

• NP chunker – after preposition NPs– “sure” non-recursive NPs

• VP chunker– Analytical wordforms– “Da” constructions– Verb clitics

• PP chunker, AP chunker, Clausal chunker

Page 25: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

After the application of some Chunk Grammars

• Common NP chunks– [един човек] от [града] (‘one man from town-the’)

• Name NP chunks: NEpers, NEloc etc.– [Министерство на културата] (‘Ministry of Culture’)

• Complex NP chunks– [нашето [Министерство на културата]]

(‘our Ministry of Culture’)

• Analytical verb forms– [да [му я даде]] (‘to him her give-3p, sg’) to give it to him

Page 26: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Integration of the resources and tools

• The order of application

• Mutual dependence

• Quantitative and qualitative expansion

The principle of cascadedness

Page 27: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Conclusion

• We described a set of basic language resources which are necessary for the creation of a Bulgarian treebank

• We outlined our tasks in the context of a ‘less-processed’ language (variety and flexibility of LRs and tools)

• It was shown that the creation of one type of resource (in our case - the treebank) can evoke the successful creation of other types of resources

Page 28: Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Future tasks

• using the LRs and tools as separate modules for applications like Information retrieval and Extraction

• to extend the basic language resources into a more elaborate set, richer in information and relations

• to continue testing and validating the resources• to invest more in their evaluation


Recommended