Post on 30-Mar-2015
transcript
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 1
ee Content Content plusplus
Standards: strength and limitationsStandards: strength and limitations
… … LMFLMF
Nicoletta CalzolariNicoletta Calzolariglottolo@ilc.cnr.itglottolo@ilc.cnr.it
Fostering Language Resources Network
http://http://www.flarenet.euwww.flarenet.eu
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 2
In Europe the so-called X-LEX X-LEX projects:ACQUILEX MULTILEXGENELEX
and other lexical and text annotation/representation projects: NERC ET-7ET-10DELIS
that saw the participation of many EU groups, linked by sharing similar approaches and visions
EAGLESISLE
After the “Grosseto Workshop” (1985): a turning
point
Historical notes
Start:Start: ZampoZampo
lli lli breakfbreakf
ast ast meetinmeetin
gg EAGLES EAGLES acronym
… by Cencioni
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 3
ReusabilityReusability as key concept true also todayTo avoid duplication of efforts, costs, etc.To allow synergies, integration, exchange of data, ...To provide a model for new data creation & acquisition
Decide on “feasible”“feasible” areas & state priorities priorities this is changing over time
The feasibility of formulation of consensual standards as a strong sign of maturity strong sign of maturity in the field we can’t propose standards if there are not enough results on which to base them
EAGLES was launchedEAGLES was launched in ‘93 in ‘93
Key issues: Do conditions Key issues: Do conditions exist exist for standardisation effort?for standardisation effort?
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 4
Some standard-related projects & initiatives
Defining standards/best practice:TEI: creating standards for text annotation NERC: creating the basis to bottom-up empirical harmonisation, based on extensive best-practice analysisEAGLES: introducing a methodological model for standard workISLE: extending in topics & communitiesLIRICS: preparing for international standardsISO/TC 37/SC 4/WG 4: going to international standards LMF … & many othersNEDO: porting to Asian languages MultilingualWeb: new Thematic Network for relation with W3C
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 5
Some standard-related projects & initiatives (cont.)
Using standards/best practice:MULTEXT & MULTEXT-EAST: applying to lexicons & text annotation, with EAGLES compliant specs
PAROLE-SIMPLE lexicons: morphology, syntax & semantics: operational specs & constraints betw. lexical descriptors (12 languages)
EuroWordNets: a de-facto best-practice
BOOTStrep: terminologies in Bio-domain: BioLexicon
KYOTO: in the environment domain
PANACEA: in a platform for LR acquisition
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 6
Some standard-related projects & initiatives (cont.)
Promoting standards/best practice:
INTERA: for a EU repository of language data
ENABLER: to link EU & national initiatives
ELRA: the EU LR association
LanguageGrid: Japanese infrastructure for LR services
CLARIN: LR standards for the Humanities & Social Sciences
FLaReNet: LR standards for Human Language Technologies
T4ME NoE: for an Open Resource Infrastructure
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 7
Main Results in Lexicon & Corpus Main Results in Lexicon & Corpus WGsWGs
First Phase First Phase (www.ilc.pi.cnr.it/EAGLES96/home.html)(www.ilc.pi.cnr.it/EAGLES96/home.html)Standard for morphosyntactic encodingmorphosyntactic encoding of lexical entriesof lexical entries, in a
multi-layered structure, with applications for all all the EU languages
Standard for subcategorisation in the lexiconsubcategorisation in the lexicon: a set of standardised basic notions using a frame-based structure
Proposal for a basic set of notions in lexical semanticslexical semantics: focus on requirements of Information Systems and MT
Corpus Encoding Standard (CES)Corpus Encoding Standard (CES) from TEI
Standard for morphosyntactic annotationmorphosyntactic annotation of corpora, to ensure compatibility/ interchangeability of concrete annotation schemata
Preliminary recommendations for syntactic annotationsyntactic annotation of corpora
Dialogue annotationDialogue annotation, for integration of written and spoken annotation
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 8
Content vs. Format/RepresentationContent vs. Format/Representation
Work on lexical description deals with two aspectsLinguistic descriptionLinguistic description of lexical items (contentcontent)Formal representationFormal representation of lexical descriptions (formatformat)
EAGLES concentrated on linguistic contentlinguistic content, not disregarding the formal representation of the proposal
TEI more on format/representation issuesIn In
In LMF : LMF : on the abstract meta-model
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 9
Flexibility in the RecommendationsFlexibility in the Recommendationse.g. Morphosyntaxe.g. Morphosyntax
Level Information Type Recommendation Recommendation
L-0 Part-of-Speech ObligatoryObligatory
L-1 Morphosyntactic agreement RecommendedRecommended
features L-2 Language-specific (or refined) OptionalOptional features
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 10
MERITS MERITS Strengths Strengths (from EAGLES-ISLE)(from EAGLES-ISLE)
Standardisation as a necessary component of any strategic programme to create a coherent marketcoherent marketLeading industrialsindustrials & academics participated (> 150 EU > 150 EU groupsgroups)
Bottom-up community created standards
To avoid wasting timeTo avoid wasting time reinventing basic/consolidated knowledge
May be true also for many “humanities” users, not interested in debates on specific lexical approaches
Work otherwise duplicated among many projects, done just just onceonce in a collaborative manner (overall cost-effectivenessoverall cost-effectiveness)Allows the field to be more competitivemore competitive:
Concentrate efforts on innovative areas Engage in new/advanced technology
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 11
Why Standards for Why Standards for Language Resources? Language Resources? (from EAGLES-ISLE)(from EAGLES-ISLE)
To ensure:
interoperability of systems (& data), through compatible interfaces
reusability and integrability of components
training based on consensual technical specifications and models (“gold standards”)
evaluation & validation based on agreed criteria
transition from prototypes to HLT products
important for workflows
essential for a LR Infrastructure
for evaluation campaigns
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 12
The applications: requirements The applications: requirements for systems & enabling for systems & enabling technologiestechnologies
Machine TranslationMachine TranslationInformation Extraction Information Extraction Information Retrieval Information Retrieval Summarisation Summarisation Natural Language GenerationNatural Language GenerationWord Clustering Word Clustering Multiword Recognition + Multiword Recognition + Extraction Extraction Word Sense DisambiguationWord Sense DisambiguationProper Noun RecognitionProper Noun RecognitionParsingParsingCoreferenceCoreference……
II For For HLT HLT
knowledge knowledge of of
applicationapplications’ s’
requiremerequirementsnts is is
essentialessential
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 13
The Multilingual ISLE Lexical The Multilingual ISLE Lexical Entry (MILE)Entry (MILE)
General methodological principlesmethodological principles (from EAGLES)
Basic requirements for the design of the MILEMILE::
Discover and list the (maximal) set of basic notionsbasic notions needed to describe the MILE (up to which level standardisation is feasible?)
GranularityGranularity
The leading principle: the edited unionedited union of existing lexicons/models (redundancyredundancy is not a problem)
Modular & layeredModular & layered
Allow for underspecification (& hierarchical structure)underspecification (& hierarchical structure)
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 14
MILE – Modularity The building-block model
syntacticframe
phrasephraseslot Synfeature
Lexical Objects
Semfeature
Lexical entry 1Lexical entry 1 Lexical entry 2Lexical entry 2 Lexical entry 3Lexical entry 3
Independent, but interlinked, modules allow to Independent, but interlinked, modules allow to express different dimensions of lexical entriesexpress different dimensions of lexical entries
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 15
MILE Lexical Classes & Lexical Objects vs ISO LMF
Lexical Classes as the main building blocks of the lexical architecture
Building blocks allow two kinds of reusability: intra-lexicon reusability (within the same lexicon) inter-lexicon reusability (among different lexicons)
Define an ontology of lexical objects represent lexical notions such as semantic unit, syntactic
feature, syntactic frame, semantic predicate, semantic relation, synset, etc.
specify the relevant attributes define the relations with other classes hierarchically structured
Done in LMF
To be done … (in ISOCat?)
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 16
The MILE Data Categories User-adaptability and extensibility
HUMANARTIFACTEVENTANIMALGROUP
AGEMAMMAL
instance_of
Core
UserDefined
MLC:SemanticFeature
OK in ISOCat
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 17
MILE Lexical Data Category RegistryA library of pre-instantiated objects
Enables modular specification of lexical entities eliminate redundancy identify lexical entries or sub-entries with shared
properties create ready-to-use packages that can be combined
in different ways
Can be used “off the shelf” or as a departure point for the definition of new or modified categories ISOCat
ISO ProfilesISO Profiles
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 18
ISO - LMFLexical Markup Framework
Designed to accommodate many models of lexical representation
Its pros: Meta-model: abstract high-level specification ISO24613 Data Category Registry: low-level specifications
ISO12620 Not a monolithic model, rather a modular
framework LMF library provides the hierarchy of lexical objects
(with structural relations among them) Data Category Registry provides a library of descriptors
to encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 19
ISO LMF
Morphology
NLP Multilingual notations
NLP MWE pattern
NLP Paradigm class
NLP Semantic
MRD
NLP Syntax
Constraint Expression
Core Package
Structural skeleton, with the basic hierarchy of information in a lexical entry
+ various extensions
Modular framework LMF specs comply with
modelling UML principles an XML DTD allows
implementation
Builds on Builds on EAGLES/EAGLES/ISLEISLE
NEDONEDOAsian Asian Lang.Lang.
The field is The field is
maturemature
NICT Language-
Grid Service Ontology
ICTICT
KYOTOKYOTO
LIRICSLIRICSNew
initiatives…
LexInfo
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 20
Mapping experiment
Major best practices:OLIFPAROLE/SIMPLELC-Star (Speech Lexicon)WordNet - EuroWordNetFrameNetBDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French
Entries from major existing lexicons mapped to LMF Entries from major existing lexicons mapped to LMF To prove that the model is able to represent many model is able to represent many
best practicesbest practices To test the expressive potentialities, the adequacy of
architectural model & linguistic objects
from Monica Monachini
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 21
BioLexicon SIMPLE model & ISO-LMF standard
BLBLBLBL
A unique large-scale computational lexicon in the biomedical domain in
terms of coverage & typology of information Populated with info from
available biomedical resources
Semi-automatically populated from corpora:
Population toolkit available
Including both domain-specific & general
language words
Rich linguistic information ranging over different linguistic
descriptions levels
Conformant to international Conformant to international lexical representation lexical representation
standardsstandards
Designed to meet bio-Text Mining requirements
from Monica Monachini
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 22
<Sense rdf:ID=“activate_2"> <belongsToSynset rdf:resource="#activate"/> <hasSemanticRelation rdf:resource="#is_a_1"/> <hasSemanticRelation rdf:resource="#has_as_part_1"/> <hasSemanticRelation rdf:resource="#object_of_the_activity_1"/> <hasSemanticFeature rdf:resource="# SF_chemistry"/> <hasSemanticFeature rdf:resource="# SF_process"/> </Sense>
Sense
activate_2
Synset
activate
PredicativeRepresentatio
n
SemanticFeature
SF_chemistry
SF_process
Collocation
SemanticRelation
is_a: [SenseID]
Typical_of: [SenseID] S_protein
Sense Representation
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 23
KYOTO SYSTEMLinear
MAF/SYNAF
LinearSEMAF
Term extraction Tybot Generic
TMF
Semantic annotation
LinearGenericFACTAF
Fact extraction Kybot
Domain editing Wikyoto
Wordnet
Domain Wordnet
LMF API
Ontology
Domain ontology
OWL APIConceptUser
FactUser
from Piek Vossen
SourceDocuments
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 24
GlobalInformation
Lemma
MonolingualExternalRef
MonolingualExternalRefs
Sense
LexicalEntry
Statement
Definition
SynsetRelation
SynsetRelations
MonolingualExternalRef
MonolingualExternalRefs
Synset
Lexicon
InterlingualExternalRef
InterlingualExternalRefs
SenseAxis
SenseAxes
LexicalResource
1..1 1..* 0..1
1..*1..*
1..1 0..*
0..1
1..*
Meta0..1
0..1
Meta
0..1 0..1
Meta Meta
0..1
Meta
0..*
0..1 0..10..1
1..* 1..*0..*
0..1
1..*
A common representation A common representation format: format: WordNet - LMFWordNet - LMF
Data Categories
from Monica Monachini
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 25
Centralized WordNet DC Registry
A list of 85 sem.rels as a result of a mapping of the KYOTO
WordNet grid Inter-WN
Intra-WN
from Monica Monachini
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 26
SWN<fuego_3, llama_1>
09686541-n
<!ELEMENT SenseAxes (SenseAxis+)><!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)><!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST TargetID CDATA #REQUIRED><!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)><!ELEMENT InterlingualExternalRef (Meta?)><!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIREDexternalReference CDATA #REQUIREDrelType (at|plus|equal) #IMPLIED>
IWN<fuoco_1, fiamma_1>
00001251-n
WordNet-LMF multilingual level - Cross-lingual relations
WN3.0<fire_1 flame_1 flaming_1>
13480848-n
groups monolingual synsets corresponding to each other and sharing the same relations to English
link to ontology/(ies)
specifies the type of correspondence
from Monica Monachini
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 27
LexInfo & Previous Models
LingInfo: modeling morphosyntatic decomposition of (complex) terms [Buitelaar et al. 2006]
LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]
Lexical Markup Framework (LMF): ISO standardised model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]
LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
From Paul Buitelaar
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 28
LexInfo: Lexical Entry Sub-Categorization Frames
From Paul Buitelaar
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 29
MILE Lexical Model oriented towards an Open Distributed Lexical
Infrastructure
Lexical Information Servers for multiple access to lexical information repositories
Enhance user-adaptivity resource sharing cooperative creation of LR & LT
Develop integration and interchange tools
Beyond MILE: future work
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 30
Some steps for a “new generation” of LRs
From huge efforts in building static, large-scale, general-purpose LRs
To dynamic LRs rapidly built on-demand, tailored to specific user needs
From closed, locally developed and centralized resources
To LRs residing over distributed places, accessible on the web, choreographed by agents acting over them
From Language Resources
To Language Services BUT
• Need of tools to make this vision operational & concrete
Interoperabili
Interoperabili
tyty
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 31
Lexical WEB & Content Interoperability
As a critical step for semantic mark-up in the SemWeb
ComLex
SIMPLE
WordNetsWordNets
WordNets
FrameNet
Lex_x
Lex_y
LMFLMF
with intelligent
agents
NomLex
Standards Standards for for
InteroperaInteroperabilitybility
EnougEnough??h??
Global WordNet GRIDGlobal WordNet GRID
BioLexicon
SIMPLE-WEBSIMPLE-WEB
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 32
A new paradigm of R&D in LRs & LTA new paradigm of R&D in LRs & LTDistributed Language Services
Open & distributed infrastructures for LRs & LTOpen & distributed infrastructures for LRs & LTAdopting the paradigm of accumulation of knowledgeaccumulation of knowledge so successful in more mature disciplines, based on sharing LRs & toolsAbility to build on previous achievements, results accessible to various systems, allowing effective effective cooperation of many groups on common taskscooperation of many groups on common tasksExchange and integrate information across repositoriesCreate new resources on the basis of existing Compose new services on demand…
A new scenario implying content interoperability standards development of architectures enabling accessibility supra-national cooperation
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 33
A few Issues for discussion:A few Issues for discussion:“content”, guidelines, tools, “content”, guidelines, tools,
priorities, ...priorities, ... For Semantic Web Semantic Web and “content” interoperability:“content” interoperability: is the field
‘mature’ enough to converge‘mature’ enough to converge also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?
For the standards to have impact, ensure their usabilityusability & gain industry support focusing on requirements of industrial requirements of industrial applicationsapplications
To have Guidelines Guidelines which are a “usable product” “usable product” (to assist in creation or adaptation of lexicons, to share resources, …)
Facilitate acceptance of the standards providing an open-source open-source reference implementation platform & toolsreference implementation platform & tools, related web servicesweb services and test suites
Relation with Spoken language Spoken language community Define further stepsfurther steps necessary to converge on common prioritiespriorities
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 34
Limits observed& needs of further work
For usability & operability of LMF: Data Categories (DC) & others: From Japanese NEDO: DC not defined in LMF & LMF non operational
Asian, African DCs Need of DC organised in profiles (easy to use) IsoCat & Profiles Need of an ontology of DCs with structure/dependencies, and
constraints Otherwise the model remains too abstract, and doesn’t say anything on how
to implement concretely the different layers Link with Ontologies: relations Lexicons-Ontologies Need of easy, user-friendly guidelines Need of tools to make it operational, also for creating standard
compliant resources: more important than the model! More dissemination, also with industry
Linguists may be (rightly for certain purposes) not interested Younger colleagues not aware of the past work on standards
Need of operational definitions of interoperability Need of stimuli also from EC to produce standard-compliant resources
(unless differently motivated)
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 35
Strengths
Good set of methodological principles: Granularity of basic notions, …
Many languages already compliant with EAGLES morpho-syntax, etc.
Many projects today using LMF Unified Lexicon experiment between Speechdat & Parole, at
ELRA (possible because EAGLES compliant) Web-services to access LRs based on standards Web-based platforms for LR integration An open infrastructure of LRT need standards New topics being constantly added: Time, Space, …
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 36
Future requirements & planning
To make LMF usable and operationalLMF User Guidelines with examples Mapping of commonly used lexicons into LMF Data categories for LMF lexiconsTool related to LMF, with particular reference to the Lexus tool
Need to address another layerThe ontological layer in a lexiconHow lexicons and ontologies are linked and information mapped from each other An open space in a wiki encironment to store guidelines, examplesto allow broad discussion on these topics to ease dissemination of LMF
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 37
FLaReNet Mission: structure the area of LR & LT of the
future Worldwide Forum for LRs & LTs
Consolidate methods, approaches, common practices, architectures Integrate so far partial solutions into broader infrastructures
A “roadmap”“roadmap”: a plan of coherent actions as input to policy development
For the EU, national organisations & industryAs a model for the LRs/LTs of the next yearsStrengthening the language product market, e.g. for new products & innovative services
Identifying areas where consensus is achieved/emerging vs. areas where more discussion & testing is requiredIndicating priorities
221221 Individual Subscribers 8181 Institutional Members from
31 countries
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 38
Promote knowledge of standards in the community Define specifications for tools supporting standards Support workshops/tutorials on how to use standards Start focusing on standards for more consensual areas &
develop for these a toolkit that can be used off-the-shelf, so that we can move on to tackling the larger problems
Identify “best practices” in standards wrt usability, usefulness, viability, outreach etc.
Adopt a model for tool & resource development based on open & collaborative development, where the community as a whole contributes components, modules, etc. to a common framework
Some results from FLaReNet Vienna Forum:
Interoperability Session Interoperability Session
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 39
Standards & Interoperability: topics for cooperation A metadata catalogue should involve every party Common repositories for LRT universally & easily accessible
Try to connect ongoing work done by many groups A shared repository of data formats, annotationsshared repository of data formats, annotations – where to find
the most frequently used and preferred schemes –major help to achieve standardisation
For a new world-wide language infrastructure Create the means to plug together different LR & LT, in a web-
based resource and technology grid Access to LRT is critical: involves – and has impact on – all the
community With the possibility to easily create new workflows Create conditions to easily share and re-use technologies, to have
more open (source) tools available for use also to under-funded groups
Some results from FLaReNet Vienna Forum:
International CooperationInternational Cooperation
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 40
Special Highlight: Contribute to building the LREC2010 Map!
Time is ripe to launch an important initiative, the LREC2010 Map of Language Resources, Technologies and Evaluation.
The Map will be a collective enterprise of the LREC community, as a first step towards the creation of a very broad, community-built, Open Resource Infrastructure.
First in a series, it will become an essential instrument to monitor the field and to identify shifts in the production, use and evaluation of LRs and LTs over the years.
When submitting a paper (< 900!), from the START page fill in a very simple template to provide essential information about resources (in a broad sense, also technologies, standards, evaluation kits.) either used for the work described or a new result of your research
The Map will be disclosed at LREC, where some event(s) will be organised around this initiative
FLaReNet & the ORI (Open Resource Infrastructure) … at LREC
N. Calzolari [FLaReNet]NEERI Workshop, Helsinki,
September 2009 41
Join FLaReNet!
We invite all interested players in the field to express their interest in becoming part of the Network
How to join? To be part of the FLaReNet Network fill the
form available on the project website (http://www.flarenet.eu)