Hamburg, [email protected] The Basic Language Resources Kit (BLARK) Steven...

Hamburg, 22-11-2004 [email protected] 1

The Basic Language Resources Kit (BLARK)

Steven Krauwer

Utrecht Institute of Linguistics UiL OTS / ELSNET


Overview

• The BLARK Enterprise• How to arrive at it• The Dutch Language Union approach• Refining the concept• Defining a BLARK• Main beneficiaries• References• Concluding remarks


The BLARK Enterprise

• Define the minimal set of language resources that is necessary to do any precompetitive R&D and professional education at all for a language (the Basic Language Resource Kit or BLARK)

• Determine for each language which components are already available

• Make a priority plan to complete the BLARK for each language

• Ensure funding to get the work done


What are the componentsof a BLARK

• Lexicons (monolingual, multilingual, …)• Corpora (language, speech; annotated,

unannotated; mono- and multilingual; mono- and multimodal; …)

• Tools (annotation, exploration, …)• Modules (lemmatizers, parsers, speech

recognizers, tts, transcribers, translation, …)• …


What makes the BLARK Enterprise special?

• The idea is to make a common generic BLARK definition, in principle applicable to all languages

• The common definition will be based on the experience with different languages, and will prevent reinvention of wheels

• The common definition will ensure interoperability and interconnectivity (especially for multilingual or cross-lingual applications)


Other benefits

• Experience from other languages will help making cost estimations

• Adoption of a BLARK common to all languages may help in persuading funders to support the creation of the BLARK

• Adoption of a common BLARK may facilitate porting of knowledge and expertise between languages


Words of caution

• A BLARK definition will evolve over time, as new applications, application environment and technologies come up

• A BLARK definition should be seen as a template rather than a dictate, as different languages may have different specific requirements

• BLARK completion priorities may differ from language to language (on e.g. economic, social or political grounds)


How to define a BLARK and assign priorities

• Methodology proposed by the Dutch Language Union [DLU] (Binnenpoorte et al, LREC 2002):– Identify a number of typical applications

– Determine for each of them which technologies (modules) are needed to make them (-, +, ++, +++)

– Identify for each module which resources they require (-, +, ++, +++)

– Assign the highest priority to the resources that support most applications


Proposed DLU priorities for NLP

1. treebank

2. robust parsers

3. tokenisation and named entity recognition

4. semantic annotations for the treebank

5. translation equivalents

6. evaluation benchmarks


Proposed DLU priorities for speech

1. automatic speech recognition

2. application-specific speech corpora

3. multi-media speech corpora

4. tools for transcription of speech data

5. speech synthesis

6. benchmarks for evaluation


Next steps by DLU

• Make a survey of what exists and to what extent it is available (0-9 availability score)

• Assign priorities (not just resources but also an infrastructure for maintenance and distribution)

• Secure funding from Dutch and Flemish government for a national programme

• Issue calls for proposals for collaborative resources projects (1st call closed Nov 2 2004)


Refining the concept

• Items not really covered by the DLU teams:– definition vs specification– availability– quality– quantity– standards– support

• Addressed in the NEMLAR project


Definition / specification

• Not enough to say ‘a written language corpus’, what about:– size (types, tokens)

– encoding

– annotation

– text types

– representativity

– domains

• i.e. we need full specs


Availability

• DLU: 0-9 scale, very impressionistic

• Our proposal: 3 dimensions– accessibility– cost– modifiability

• to each we assign a penalty score (0 is best)


Accessibility

• 3 classes, with associated penalties– (3) existing, but only company-internal– (2) existing and freely usable for

precompetitive research– (1) existing and freely usable for all R&D


Cost

• 4 cost categories:– (4) price over 10 keuro– (3) price between 1 and 10 keuro– (2) price between 100 and 1000 euro– (1) less than 100 euro


Modifiability

• 3 categories– (3) black box: you get them as they are, but you

cannot change or even inspect its internals– (2) glass box: you can’t change them but you

can see what is inside)– (1) open resources: freely manipulable


Comments on availability

• we can now express availability in a 3 digit score (accessibility, cost, modifiability) which should be rather easy to assign objectively

• the lowest scores are the best

• if the accessibility score is 3, the other scores don’t mean very much


Quality

• We distinguish two types of quality: absolute (I.e. an inherent property of the resource) and relative (I.e. in relation to how you want to use it):

• Absolute: standard-compliance and soundness

• Relative: task-relevance and environment-relevance


Standard-compliance

• criterion: to what extent is the resource based on a common standard (formal or de facto)

• possible values (penalty based):– (3) no standard– (2) standard, but not fully compliant– (1) standard and fully compliant


Soundness

• criterion: to what extent is the resource based on well-defined specifications

• values:– (3) no specifications provided– (2) specs provided, but not fully compliant– (1) specs provided, fully compliant


Task-relevance

• criterion (relative): to what extent is the resources suited for a specific task X

• values (3 binary values):– contains all information needed for X (yes/no)– has the proper size for X(yes/no)– based on a relevant selection of items for X

(yes/no)


Environment-relevance

• criterion: to what extent is the resource interoperable with its environment (other resources)

• values (3 binary valuas):– information matches (yes/no)– size matches (yes/no)– selection matches (yes/no)


Comments on quality

• We can now express absolute quality objectively in terms of a pair of scores (standard-compliance, soundness); this score can be assigned by the provider

• and relative quality (for our own purposes) in terms of two triples of yes/no answers (task-relevance, environment-relevance); this score can only be assigned by the user

• other attributes may be added as long as they can be objectively assigned


Quantity

• The DLU team did not try to formulate any quantitative requirements

• We have tried to do this in the context of the NEMLAR project, see below for our tentative figures

• Statistical approaches can swallow any amount of resources, and minimal figures are very hard to find

• Our figure finding exercise has been very much example driven


Standards

• Very few existing formal standards around, although some exist (cf Romary & Ide at LREC2004 workshop, Monachini et al, 2003)

• Evolving de facto standards include:– Bottom-up work by committees (TEI)– Top-down actions:

• Projects aiming at standards (e.g. EAGLES, ISLE)• Example setting R&D projects (e.g. Wordnet, Speechdat,

Multext)

• Our position: any standard is better than no standard at all


Defining a BLARK

• Work carried out in the context of the NEMLAR project (www.nemlar.org), aimed at Arabic resources

• Work described here based on project deliverables (see site), summarized in article by Maegaard, Krauwer, Choukri, Damsgaard presented at NEMLAR conference in Cairo (Sep 2004)

http://www.nemlar.org/


Approach adopted

• Same strategy as Dutch Language Union (applications => modules => resources)

• But with different results because of differences in social/economic situation and in language structure

• Results follow, in terms of global definitions and tentative size indications (no specs provided at this stage, but project is still ongoing)

• Feedback is welcome!!!!!!!!


Written resources (1)

• Lexicon:– For all components: 40 000 stems with POS &

morphology

– For sentence boundary detection: list of conjunctions and other sentence starters/stoppers

– For named entity recognition: 50 000 human proper names

– For semantic analysis: same 40 000, with subcategorization, shallow lexical semantic info; possibly a WordNet



• Bi-/Multilingual lexicon– Same size as monolingual

• Thesauri, ontologies, wordnets:– Thesaurus subtree with ca 200-300 nodes for

each domain– Ontologies and wordnets ideally same size as

lexicon



• Corpora:– For term extraction: 100 million words unannoteted

– For small applications: 0.5 million words annotated

– For statistical POS tagger: 1-3 million (ann)

– Sentence boundary: 0.5-1.5 million (ann)

– Named entity (stat based): 1.5 million (ann)

– Term extraction: 100 million (ann)

– Co-reference resolution: 1 million (ann)

– WSD: 2-3 million (ann)



• Multilingual corpora:– For alignment: 0.5 million (tagged)

• Multimodal corpora:– For OCR (printed): ??– For OCR (hand-written): ??


Spoken resources (1)

• Acoustic data:– For dictation: 50-100 speakers, 20 min each, fully

transcribed, plus 10 speakers for testing– For telephony: 500 speakers uttering 50 different

sentences (speechdat, orientel based)– For embedded speech recognition: data similar to

Speecon– For broadcast news transcription: 50-100 hours well-

annotated, plus 1000 hours of non-transcribed data; should come with 300 million words of non-annotated written text



• Acoustic data (cont’d):– For conversational speech: data similar to

CallHome/CallFriends from LDC– For speaker recognition: 500 speakers for training, 3

minutes each, transcribed, plus 100 speakers for testing– For language/dialect identification: data similar to

CallFriend, or from Broadcast News (esp for variants of Arabic)

– For speech synthesis: male and female speakers, 15 hours, using a read text, phonetically balanced

– For formant synthesis: sama as above, with hand-labelled formant



• Multimodal corpora:– For lips movement reading: similar to M2VTS, with

some 50 faces

• Written corpora for speech technologies:– General; 300 million words unannotated, preferably

broadcast news or other press and media sources

– For phonetic lexicon and language models: 1-5 million words, annotated

– For Arabic: vowelized and non-vowelized corpus


What next? (1)

• Check definition and quantification for completeness and consistency and correct

• Try to provide specs for every single item

• Try to differentiate between general and Arabic in definitions and specs


What next? (2)

• For each language:– Take the BLARK definition and specs– Adapt to local conditions– Make a survey of what exists and what has to

be made– Find the funds and build the BLARK for your

language


Prescriptive / descriptive

• Prescriptive:– the BLARK definition tells you which

ingredients you need– the specification tells you what they should

look like

• Descriptive:– a BLARK instantiation comes with a

description of its components


Main beneficiaries (1)

• academic and industrial researchers: material to try out ideas and conduct pilot studies

• industrial developers: only for generic activities, since specific applications require more user and domain orientation

• educators: material for experimental work by students in labs


Main beneficiaries (2)

• probably not the main languages in Europe (EN, FR, GE) as they are pretty well covered anyway

• mostly the languages that are not supported by a strong market (because of small size or poor economy)


References

• Binnenpoorte et al at LREC 2002 (see also www.elsnet.org/dox/lrec2002-binnenpoorte.pdf

• ELRA Newsletter vol 3, n 2, 1998 (see also www.elsnet.org/blark.html)

• NEMLAR: see www.nemlar.org for– Arabic BLARK Report– NEMLAR presentation at Cairo conference

• Romary & Ide at LREC 2004 (see also www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt)


Concluding remarks

• The BLARK aims at providing a common definition of the notion ‘minimal set of resources’

• It should help language communities to come closer to the idea of creating an equal playing field, in spite of market forces

• It should facilitate porting of expertise• It is necessarily dynamic, as technologies evolve

rapidly


Thanks!

Contact:

[email protected]

Date post:	30-Dec-2015
Category:	Documents
Upload:	justin-newton
View:	216 times
Download:	1 times

Hamburg, [email protected] The Basic Language Resources Kit (BLARK) Steven...

Documents