Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | justin-newton |
View: | 216 times |
Download: | 1 times |
Hamburg, 22-11-2004 [email protected] 1
The Basic Language Resources Kit (BLARK)
Steven Krauwer
Utrecht Institute of Linguistics UiL OTS / ELSNET
Hamburg, 22-11-2004 [email protected] 2
Overview
• The BLARK Enterprise• How to arrive at it• The Dutch Language Union approach• Refining the concept• Defining a BLARK• Main beneficiaries• References• Concluding remarks
Hamburg, 22-11-2004 [email protected] 3
The BLARK Enterprise
• Define the minimal set of language resources that is necessary to do any precompetitive R&D and professional education at all for a language (the Basic Language Resource Kit or BLARK)
• Determine for each language which components are already available
• Make a priority plan to complete the BLARK for each language
• Ensure funding to get the work done
Hamburg, 22-11-2004 [email protected] 4
What are the componentsof a BLARK
• Lexicons (monolingual, multilingual, …)• Corpora (language, speech; annotated,
unannotated; mono- and multilingual; mono- and multimodal; …)
• Tools (annotation, exploration, …)• Modules (lemmatizers, parsers, speech
recognizers, tts, transcribers, translation, …)• …
Hamburg, 22-11-2004 [email protected] 5
What makes the BLARK Enterprise special?
• The idea is to make a common generic BLARK definition, in principle applicable to all languages
• The common definition will be based on the experience with different languages, and will prevent reinvention of wheels
• The common definition will ensure interoperability and interconnectivity (especially for multilingual or cross-lingual applications)
Hamburg, 22-11-2004 [email protected] 6
Other benefits
• Experience from other languages will help making cost estimations
• Adoption of a BLARK common to all languages may help in persuading funders to support the creation of the BLARK
• Adoption of a common BLARK may facilitate porting of knowledge and expertise between languages
Hamburg, 22-11-2004 [email protected] 7
Words of caution
• A BLARK definition will evolve over time, as new applications, application environment and technologies come up
• A BLARK definition should be seen as a template rather than a dictate, as different languages may have different specific requirements
• BLARK completion priorities may differ from language to language (on e.g. economic, social or political grounds)
Hamburg, 22-11-2004 [email protected] 8
How to define a BLARK and assign priorities
• Methodology proposed by the Dutch Language Union [DLU] (Binnenpoorte et al, LREC 2002):– Identify a number of typical applications
– Determine for each of them which technologies (modules) are needed to make them (-, +, ++, +++)
– Identify for each module which resources they require (-, +, ++, +++)
– Assign the highest priority to the resources that support most applications
Hamburg, 22-11-2004 [email protected] 9
Proposed DLU priorities for NLP
1. treebank
2. robust parsers
3. tokenisation and named entity recognition
4. semantic annotations for the treebank
5. translation equivalents
6. evaluation benchmarks
Hamburg, 22-11-2004 [email protected] 10
Proposed DLU priorities for speech
1. automatic speech recognition
2. application-specific speech corpora
3. multi-media speech corpora
4. tools for transcription of speech data
5. speech synthesis
6. benchmarks for evaluation
Hamburg, 22-11-2004 [email protected] 11
Next steps by DLU
• Make a survey of what exists and to what extent it is available (0-9 availability score)
• Assign priorities (not just resources but also an infrastructure for maintenance and distribution)
• Secure funding from Dutch and Flemish government for a national programme
• Issue calls for proposals for collaborative resources projects (1st call closed Nov 2 2004)
Hamburg, 22-11-2004 [email protected] 12
Refining the concept
• Items not really covered by the DLU teams:– definition vs specification– availability– quality– quantity– standards– support
• Addressed in the NEMLAR project
Hamburg, 22-11-2004 [email protected] 13
Definition / specification
• Not enough to say ‘a written language corpus’, what about:– size (types, tokens)
– encoding
– annotation
– text types
– representativity
– domains
• i.e. we need full specs
Hamburg, 22-11-2004 [email protected] 14
Availability
• DLU: 0-9 scale, very impressionistic
• Our proposal: 3 dimensions– accessibility– cost– modifiability
• to each we assign a penalty score (0 is best)
Hamburg, 22-11-2004 [email protected] 15
Accessibility
• 3 classes, with associated penalties– (3) existing, but only company-internal– (2) existing and freely usable for
precompetitive research– (1) existing and freely usable for all R&D
Hamburg, 22-11-2004 [email protected] 16
Cost
• 4 cost categories:– (4) price over 10 keuro– (3) price between 1 and 10 keuro– (2) price between 100 and 1000 euro– (1) less than 100 euro
Hamburg, 22-11-2004 [email protected] 17
Modifiability
• 3 categories– (3) black box: you get them as they are, but you
cannot change or even inspect its internals– (2) glass box: you can’t change them but you
can see what is inside)– (1) open resources: freely manipulable
Hamburg, 22-11-2004 [email protected] 18
Comments on availability
• we can now express availability in a 3 digit score (accessibility, cost, modifiability) which should be rather easy to assign objectively
• the lowest scores are the best
• if the accessibility score is 3, the other scores don’t mean very much
Hamburg, 22-11-2004 [email protected] 19
Quality
• We distinguish two types of quality: absolute (I.e. an inherent property of the resource) and relative (I.e. in relation to how you want to use it):
• Absolute: standard-compliance and soundness
• Relative: task-relevance and environment-relevance
Hamburg, 22-11-2004 [email protected] 20
Standard-compliance
• criterion: to what extent is the resource based on a common standard (formal or de facto)
• possible values (penalty based):– (3) no standard– (2) standard, but not fully compliant– (1) standard and fully compliant
Hamburg, 22-11-2004 [email protected] 21
Soundness
• criterion: to what extent is the resource based on well-defined specifications
• values:– (3) no specifications provided– (2) specs provided, but not fully compliant– (1) specs provided, fully compliant
Hamburg, 22-11-2004 [email protected] 22
Task-relevance
• criterion (relative): to what extent is the resources suited for a specific task X
• values (3 binary values):– contains all information needed for X (yes/no)– has the proper size for X(yes/no)– based on a relevant selection of items for X
(yes/no)
Hamburg, 22-11-2004 [email protected] 23
Environment-relevance
• criterion: to what extent is the resource interoperable with its environment (other resources)
• values (3 binary valuas):– information matches (yes/no)– size matches (yes/no)– selection matches (yes/no)
Hamburg, 22-11-2004 [email protected] 24
Comments on quality
• We can now express absolute quality objectively in terms of a pair of scores (standard-compliance, soundness); this score can be assigned by the provider
• and relative quality (for our own purposes) in terms of two triples of yes/no answers (task-relevance, environment-relevance); this score can only be assigned by the user
• other attributes may be added as long as they can be objectively assigned
Hamburg, 22-11-2004 [email protected] 25
Quantity
• The DLU team did not try to formulate any quantitative requirements
• We have tried to do this in the context of the NEMLAR project, see below for our tentative figures
• Statistical approaches can swallow any amount of resources, and minimal figures are very hard to find
• Our figure finding exercise has been very much example driven
Hamburg, 22-11-2004 [email protected] 26
Standards
• Very few existing formal standards around, although some exist (cf Romary & Ide at LREC2004 workshop, Monachini et al, 2003)
• Evolving de facto standards include:– Bottom-up work by committees (TEI)– Top-down actions:
• Projects aiming at standards (e.g. EAGLES, ISLE)• Example setting R&D projects (e.g. Wordnet, Speechdat,
Multext)
• Our position: any standard is better than no standard at all
Hamburg, 22-11-2004 [email protected] 27
Defining a BLARK
• Work carried out in the context of the NEMLAR project (www.nemlar.org), aimed at Arabic resources
• Work described here based on project deliverables (see site), summarized in article by Maegaard, Krauwer, Choukri, Damsgaard presented at NEMLAR conference in Cairo (Sep 2004)
Hamburg, 22-11-2004 [email protected] 28
Approach adopted
• Same strategy as Dutch Language Union (applications => modules => resources)
• But with different results because of differences in social/economic situation and in language structure
• Results follow, in terms of global definitions and tentative size indications (no specs provided at this stage, but project is still ongoing)
• Feedback is welcome!!!!!!!!
Hamburg, 22-11-2004 [email protected] 29
Written resources (1)
• Lexicon:– For all components: 40 000 stems with POS &
morphology
– For sentence boundary detection: list of conjunctions and other sentence starters/stoppers
– For named entity recognition: 50 000 human proper names
– For semantic analysis: same 40 000, with subcategorization, shallow lexical semantic info; possibly a WordNet
Hamburg, 22-11-2004 [email protected] 30
Written resources (2)
• Bi-/Multilingual lexicon– Same size as monolingual
• Thesauri, ontologies, wordnets:– Thesaurus subtree with ca 200-300 nodes for
each domain– Ontologies and wordnets ideally same size as
lexicon
Hamburg, 22-11-2004 [email protected] 31
Written resources (3)
• Corpora:– For term extraction: 100 million words unannoteted
– For small applications: 0.5 million words annotated
– For statistical POS tagger: 1-3 million (ann)
– Sentence boundary: 0.5-1.5 million (ann)
– Named entity (stat based): 1.5 million (ann)
– Term extraction: 100 million (ann)
– Co-reference resolution: 1 million (ann)
– WSD: 2-3 million (ann)
Hamburg, 22-11-2004 [email protected] 32
Written resources (4)
• Multilingual corpora:– For alignment: 0.5 million (tagged)
• Multimodal corpora:– For OCR (printed): ??– For OCR (hand-written): ??
Hamburg, 22-11-2004 [email protected] 33
Spoken resources (1)
• Acoustic data:– For dictation: 50-100 speakers, 20 min each, fully
transcribed, plus 10 speakers for testing– For telephony: 500 speakers uttering 50 different
sentences (speechdat, orientel based)– For embedded speech recognition: data similar to
Speecon– For broadcast news transcription: 50-100 hours well-
annotated, plus 1000 hours of non-transcribed data; should come with 300 million words of non-annotated written text
Hamburg, 22-11-2004 [email protected] 34
Spoken resources (2)
• Acoustic data (cont’d):– For conversational speech: data similar to
CallHome/CallFriends from LDC– For speaker recognition: 500 speakers for training, 3
minutes each, transcribed, plus 100 speakers for testing– For language/dialect identification: data similar to
CallFriend, or from Broadcast News (esp for variants of Arabic)
– For speech synthesis: male and female speakers, 15 hours, using a read text, phonetically balanced
– For formant synthesis: sama as above, with hand-labelled formant
Hamburg, 22-11-2004 [email protected] 35
Spoken resources (3)
• Multimodal corpora:– For lips movement reading: similar to M2VTS, with
some 50 faces
• Written corpora for speech technologies:– General; 300 million words unannotated, preferably
broadcast news or other press and media sources
– For phonetic lexicon and language models: 1-5 million words, annotated
– For Arabic: vowelized and non-vowelized corpus
Hamburg, 22-11-2004 [email protected] 36
What next? (1)
• Check definition and quantification for completeness and consistency and correct
• Try to provide specs for every single item
• Try to differentiate between general and Arabic in definitions and specs
Hamburg, 22-11-2004 [email protected] 37
What next? (2)
• For each language:– Take the BLARK definition and specs– Adapt to local conditions– Make a survey of what exists and what has to
be made– Find the funds and build the BLARK for your
language
Hamburg, 22-11-2004 [email protected] 38
Prescriptive / descriptive
• Prescriptive:– the BLARK definition tells you which
ingredients you need– the specification tells you what they should
look like
• Descriptive:– a BLARK instantiation comes with a
description of its components
Hamburg, 22-11-2004 [email protected] 39
Main beneficiaries (1)
• academic and industrial researchers: material to try out ideas and conduct pilot studies
• industrial developers: only for generic activities, since specific applications require more user and domain orientation
• educators: material for experimental work by students in labs
Hamburg, 22-11-2004 [email protected] 40
Main beneficiaries (2)
• probably not the main languages in Europe (EN, FR, GE) as they are pretty well covered anyway
• mostly the languages that are not supported by a strong market (because of small size or poor economy)
Hamburg, 22-11-2004 [email protected] 41
References
• Binnenpoorte et al at LREC 2002 (see also www.elsnet.org/dox/lrec2002-binnenpoorte.pdf
• ELRA Newsletter vol 3, n 2, 1998 (see also www.elsnet.org/blark.html)
• NEMLAR: see www.nemlar.org for– Arabic BLARK Report– NEMLAR presentation at Cairo conference
• Romary & Ide at LREC 2004 (see also www.elsnet.org/lrec2004-roadmap/Romary-Ide.ppt)
Hamburg, 22-11-2004 [email protected] 42
Concluding remarks
• The BLARK aims at providing a common definition of the notion ‘minimal set of resources’
• It should help language communities to come closer to the idea of creating an equal playing field, in spite of market forces
• It should facilitate porting of expertise• It is necessarily dynamic, as technologies evolve
rapidly