+ All Categories
Home > Documents > BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ RESOURCES

Date post: 14-Jan-2016
Category:
Upload: leif
View: 29 times
Download: 0 times
Share this document with a friend
Description:
BUILDING BULGARIAN NooJ RESOURCES. SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. BUILDING BULGARIAN NooJ RESOURCES. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation programme) Objectives: - PowerPoint PPT Presentation
Popular Tags:
15
BUILDING BULGARIAN BUILDING BULGARIAN NooJ RESOURCES NooJ RESOURCES SVETLA KOEVA SVETLA KOEVA SVETLOZARA LESEVA SVETLOZARA LESEVA BORISLAV RIZOV BORISLAV RIZOV
Transcript
Page 1: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN BUILDING BULGARIAN NooJ RESOURCESNooJ RESOURCES

SVETLA KOEVASVETLA KOEVA

SVETLOZARA LESEVASVETLOZARA LESEVA

BORISLAV RIZOVBORISLAV RIZOV

Page 2: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

The project The project Automatic information Automatic information extraction based on semantic relations extraction based on semantic relations (RILA – a bilateral co-operation (RILA – a bilateral co-operation programme)programme)

Objectives:Objectives:

Reliable (exhaustive and precise) multilingual Reliable (exhaustive and precise) multilingual lexical resources for a variety of purposes lexical resources for a variety of purposes such as machine translation, information such as machine translation, information extraction and information retrieval, etc.extraction and information retrieval, etc.

Page 3: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Prerequisites for carrying out such task:Prerequisites for carrying out such task:

Large-coverage linguistic resources such as Large-coverage linguistic resources such as comprehensive multilingual and monolingual comprehensive multilingual and monolingual dictionariesdictionaries (designed according to certain (designed according to certain criteria and stored in a format such as would criteria and stored in a format such as would ensure accessibility and manageability).ensure accessibility and manageability).

Ancillary (esp. disambiguation and Ancillary (esp. disambiguation and recognition) resources.recognition) resources.

An appropriate system for the storage and An appropriate system for the storage and management of multilingual linguistic data, management of multilingual linguistic data, as well as the implementation of task-related as well as the implementation of task-related procedures. procedures.

Page 4: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Methodology Methodology

Systematization and unification of the existing Systematization and unification of the existing INTEX resources as well as their conversion in INTEX resources as well as their conversion in compatibility with the established NooJ format.compatibility with the established NooJ format.

Expansion and enhancement of the resources Expansion and enhancement of the resources aiming at ever higher precision and recall aiming at ever higher precision and recall parameters.parameters.

Creation of various new resources using the Creation of various new resources using the experience, resources and tools developed along experience, resources and tools developed along the first two lines.the first two lines.

Page 5: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Conversion of the lexical resources in Conversion of the lexical resources in DELA format to the .nod format:DELA format to the .nod format:

Conversion of the BGD (Bulgarian Grammar Conversion of the BGD (Bulgarian Grammar Dictionary)Dictionary)11 automata underlying the DELAF automata underlying the DELAF dictionaries to the .flx automata description.dictionaries to the .flx automata description.

Creation of automata for the existing Creation of automata for the existing dictionaries of compounds since they have dictionaries of compounds since they have been stored in DELACF format.been stored in DELACF format.

Koeva, S.Koeva, S. Grammar Dictionary of BulgarianGrammar Dictionary of Bulgarian. . Description of the Description of the concept of organization of the linguistic data. Bulgarian Languageconcept of organization of the linguistic data. Bulgarian Language 6, 6, pp. pp. 49-58 49-58

Page 6: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Conversion of the INTEX graphs into the Conversion of the INTEX graphs into the NooJ format:NooJ format:

Preprocessing graphs: Preprocessing graphs: Compound conjunctions graphs.Compound conjunctions graphs. Abbreviations and elision graphs (with Abbreviations and elision graphs (with

possible treatment in a dictionary), etc.possible treatment in a dictionary), etc. Recognition graphs developed along tasks Recognition graphs developed along tasks

involving automatic treatment of syntactic involving automatic treatment of syntactic phenomena.phenomena.

Page 7: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Expanding the compound words Expanding the compound words dictionaries with new entries in a dictionaries with new entries in a systematic way (covering large and systematic way (covering large and diverse areas of the lexicon`s inventory diverse areas of the lexicon`s inventory of compounds). of compounds).

Establishing the resources to be used: Establishing the resources to be used: The available specialised on-line dictionariesThe available specialised on-line dictionaries The lexical-semantic data base - the The lexical-semantic data base - the Bulgarian Bulgarian

WordNetWordNet..

Developing automata for the inflection types Developing automata for the inflection types in the established format.in the established format.

Page 8: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Specifics:Specifics:

Restricted paradigms for certain types of Restricted paradigms for certain types of compounds (esp. domain-specific terms) – compounds (esp. domain-specific terms) – pluralia tantum, singularia tantum, count pluralia tantum, singularia tantum, count forms, plural endings.forms, plural endings.

Invariable forms or forms that are not Invariable forms or forms that are not established in the Bulgarian language, esp. established in the Bulgarian language, esp. ones introduced in the language as ones introduced in the language as transcription of mainly English terms, etc. transcription of mainly English terms, etc. ((hedge, swap, bear market, bull market, etc.hedge, swap, bear market, bull market, etc.))

Page 9: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Compounds extraction from the above Compounds extraction from the above mentioned resources (enhanced mentioned resources (enhanced complementarily):complementarily):

Extraction of thematic compound dictionaries of Extraction of thematic compound dictionaries of terms, named entities, other compound lexemes terms, named entities, other compound lexemes (using semantic relations encoded in the data (using semantic relations encoded in the data base and employing inheritance to the task).base and employing inheritance to the task).

Employing NooJ as environment for compounds Employing NooJ as environment for compounds extraction, processing of the obtained material extraction, processing of the obtained material with the already designed dictionaries and with the already designed dictionaries and encoding of the appropriate candidates among encoding of the appropriate candidates among the unrecognized tokens.the unrecognized tokens.

Page 10: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Page 11: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Dictionaries generation enhancementDictionaries generation enhancement

Exploring large data bases and spotting Exploring large data bases and spotting different head words inflection types using the different head words inflection types using the existing automata:existing automata: Using chiefly Using chiefly Bulgarian WordNetBulgarian WordNet where head where head

words of compounds are marked words of compounds are marked unambiguously.unambiguously.

Using simple syntactic grammars (identifying Using simple syntactic grammars (identifying NPs) to spot head words in the available NPs) to spot head words in the available domain specific dictionaries of concepts and domain specific dictionaries of concepts and terms (more comprehensive with regard to the terms (more comprehensive with regard to the coverage of types of inflection).coverage of types of inflection).

Page 12: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Page 13: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Recognition enhancementRecognition enhancement

Development of morphological grammars embracing Development of morphological grammars embracing certain classes of words not present currently in any certain classes of words not present currently in any dictionary, provided the source words are in the dictionary, provided the source words are in the dictionary:dictionary:

Personal feminine nouns Personal feminine nouns приятел приятел (friend) - (friend) - приятелкаприятелка (girl friend)(girl friend)

Diminutive nouns – Diminutive nouns – детенцедетенце (a small child)(a small child), , кученцекученце (a small dog), etc.(a small dog), etc.

Verbal nouns, etc.Verbal nouns, etc.

Page 14: BUILDING BULGARIAN NooJ RESOURCES

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES

Page 15: BUILDING BULGARIAN NooJ RESOURCES

Present day and future directions:Present day and future directions:

Information retrieval, machine translation, Information retrieval, machine translation, etc.etc.

Facilitating linguistic tasks by supplying the Facilitating linguistic tasks by supplying the prerequisites - large resources as input data prerequisites - large resources as input data – for the exploration of linguistic phenomena, – for the exploration of linguistic phenomena, validation of linguistic hypotheses on validation of linguistic hypotheses on language material.language material.

Education (facilitating the acquisition of Education (facilitating the acquisition of knowledge and skills in NLP) knowledge and skills in NLP)

BUILDING BULGARIAN NooJ BUILDING BULGARIAN NooJ RESOURCESRESOURCES


Recommended