+ All Categories
Home > Documents > Infrastructural Language Resources & Standards for Multilingual Computational Lexicons

Infrastructural Language Resources & Standards for Multilingual Computational Lexicons

Date post: 03-Jan-2016
Category:
Upload: jerry-johnson
View: 30 times
Download: 1 times
Share this document with a friend
Description:
Infrastructural Language Resources & Standards for Multilingual Computational Lexicons Nicoletta Calzolari … with many others Istituto di Linguistica Computazionale - CNR - Pisa [email protected]. The ENABLER Mission. - PowerPoint PPT Presentation
Popular Tags:
129
Pisa, September 2004 Infrastructural Infrastructural Language Resources Language Resources & & Standards for Multilingual Standards for Multilingual Computational Lexicons Computational Lexicons Nicoletta Calzolari Nicoletta Calzolari … with many others Istituto di Linguistica Computazionale - CNR - Pisa [email protected]
Transcript
  • Infrastructural Language Resources & Standards for Multilingual Computational Lexicons Nicoletta Calzolari with many others

    Istituto di Linguistica Computazionale - CNR - Pisa

    [email protected]

    Pisa, September 2004

  • The ENABLER MissionLanguage Resources (LRs) & Evaluation: central component of the linguistic infrastructure

    LRs supported by national funding in National Projects

    Availability of LRs also a sensitive issue, touching the sphere of linguistic and cultural identity, but also with economical and political implications

    The ENABLER Network of National initiatives, aims at enabling the realisation of a cooperative framework

    formulate a common agenda of medium- & long-term research priorities contribute to the definition of an overall framework for the provision of LRs

    Pisa, September 2004

  • towards .Only Combining the strengths of different initiatives & communitiesExploiting at best the modus operandi of the national funding authorities in different national situationsResponding to/anticipating needs and priorities of R&D & industrial communitiesPromoting the adoption of [de facto] standards, best practicesWith a clear distinction of tasks & roles for different actors

    We can produce the synergies, economy of scale, convergence & critical mass necessary to provide the infrastructural LRs needed to realise the full potential of a multilingual global information society

    Pisa, September 2004

  • Lexicon and Corpus:a multi-faceted interactionL CtaggingC Lfrequencies (of different linguistic objects)C Lproper nouns, acronyms, L Cparsing, chunking, C Ltraining of parsers C Llexicon updating C Lcollocational data (MWE, idioms, gram. patterns ...)C Lnuances of meanings & semantic clusteringC L acquisition of lexical (syntactic/semantic) knowledge L Csemantic tagging/word-sense disambiguation (e.g. in Senseval)C Lmore semantic information on LEC Lcorpus based computational lexicographyC Lvalidation of lexical modelsC LL C...

    Pisa, September 2004

  • ...Language as a ContinuumInteresting - and intriguing - aspects of corpus use:

    impossibility of descriptions based on a clear-cut boundary betw. what is admitted and what is not

    in actual usage, language displays a large number of properties behaving as a continuum, and not as properties of yes/no type

    the same is true for the so-called rules, where we find more a tendency towards rules than precise rules in corpus evidence

    difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary BUT Lexicon & Corpus as two viewpoints on the same ling. object. even more in a multilingual context

    Pisa, September 2004

  • Extraction from texts vs.formal representation in lexicons

    It is difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary

    The rigour and lack of flexibility of formal representation languages causes difficulties when mapping into it NL word meaning, ambiguous and flexible by its own nature

    No clear-cut boundary when analysing many phenomena: its more a continuum

    The same impression if one looks at examples of types of alternations:no clear-cut classes across languagesor within one language

    Pisa, September 2004

  • Correlation between different levels of linguistic description in the design of a lexical entryTo understand word-meaning:

    Focus on the correlation between syntactic and semantic aspects

    But other linguistic levels - such as morphology, morphosyntax, lexical cooccurrence, collocational data, etc. - are closely interrelated/involved

    These relations must be captured when accounting for meaning discrimination

    The complexity of these interrelationships makes semantic disambiguation such a hard task in NLP

    Textual corpora as a device to discover and reveal the intricacy of these relationshipsFrame/SIMPLE semantics as a device to unravel and disentangle the complex situation into elementary and computationally manageable pieces

    Pisa, September 2004

  • towards Corpus based Semantic Lexicons at least in principleboth in the design of the model , &in the building of the lexicon (at least partially)

    with (semi-)automatic means

    Design of the lexical entry with a combined approach:

    theoretical: e.g. Fillmore Frame Semantics/ Pustejovsky Generative Lexicon, empirical: Corpus evidence

    even if: not always there are sound and explicit criteria for classification according to frame elements/qualia relations/...

    Pisa, September 2004

  • But they will never be completeSemantic networks: Euro-/ItalWordNetLexicons: PAROLE/SIMPLE/CLIPSTreeBanks Infrastructure of Language Resources...Lexical acquisition systems (syntactic & semantic) from corporaInfrastructure of tools

    Robust morphosyntactic & syntactic analysersWord-sense disambiguation systemsSense classifiers......staticdynamicInternational Standards

    Pisa, September 2004

  • ItalWordNet Semantic Network[Italian module of EuroWordNet]~ 50.000 lemmas organized in synonym groups (synsets), structured in hierarchies & linked by ~ 130.000 semantic relations

    ~ 50.000 hyperonymy/hyponymy relations~ 16.000 relations among different POS (role, cause, derivation, etc..)~ 2.000 part-whole relations~ 1.500 antonymy relations, etc.

    Synsets linked to the InterLingual Index (ILI=Princeton WordNet),

    Through the ILI link to all the European WordNets (de-facto standard) & to the common Top Ontology

    Possibility of plug-in with domain terminological lexicons(legal, maritime)

    Usable in IR, CLIR, IE, QA, ...

    Pisa, September 2004

  • EuroWordNet Multilingual Data Structure

    Pisa, September 2004

    hond

    dog

    cane

    perro

    dog

    Italian

    WN

    Spanish

    WN

    TOP

    ONTOLOGY

    Dutch

    WN

    English

    WN

    Living

    Animal

    ILI

    Human

    French

    WN

    German

    WN

    Estonian

    WN

    Czech

    WN

  • {Casa, abitazione, dimora }Hyperonym: {edificio,..}Hyponym:{villetta }{catapecchia, bicocca, .. }{cottage}{bungalow }

    Role_location: {stare, abitare, ...}Role_target_direction: {rincasare}Role_patient: {affitto, locazione}Mero_part: {vestibolo} {stanza}Holo_part: {casale} {frazione} {caseggiato}home, domicile, ..house TOP Concepts:Object,Artifact,BuildingSynsets linkedby Semantic Relations in ItalWordNet

    Pisa, September 2004

  • Jur-WordNetWith ITTG-CNR (Istituto di Teoria e Tecniche dellinformazione Giuridica)

    Jur-WordNet Extension for the juridical domain of ItalWordNetKnowledge base for multilingual access to sources of legal information

    Source of metadata for semantic mark-up of legal texts

    To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc.

    Pisa, September 2004

  • Terminological Lexicon of Navigation & Sea Transportation NoloSynsets 1.614Lemmas 2.116Senses 2.232Nouns 1.621Verbs 205Adjectives 35Proper Nouns 236

    Pisa, September 2004

  • PAROLEItal. Synt. Lex.96-98SIMPLEItal. Sem. Lex.98-2000CLIPS2000-2004morphology: 20,000 entriessyntax: 20,000 words semantics: 10,000 senses

    phonologymorphology 55,000 words syntaxsemantics: 55,000 sensesSGMLSGMLXMLPAROLE/SIMPLE12 harmonised computational lexiconshttp://www.ilc.cnr.it/clips/

    Pisa, September 2004

  • machine language learning

    Pisa, September 2004

  • machine language learningdevelopment of conceptual networkslinguistic learningadaptive classification systemsinformation extractionbootstrapping of grammars linguistic change modelslanguage usage modelsbootstrapping of lexical information

    Pisa, September 2004

  • lexicaunstructuredtextdataannotationtoolsannotateddatamachine learningfor linguistic knowledge acquisitionlexicacross-lingualinformationretrievalmulti-lingualinformationextractionmulti-lingual textmining

    userneeds

    lexiconmodelArchitecture for linguistic knowledge acquisition ...LKG. towards dynamic lexicons, able to auto-enrichterminology

    Pisa, September 2004

  • Harmonisation:More & more Need of a Global Viewfor Global InteroperabilityIntegration/sharing of data & software/tools Need of compatibility among various componentsAn exemplary cycle:

    FormalismsGrammarsSoftware: Taggers,Chunkers, Parsers, Representation Annotation Lexicon Corpora TerminologySoftware: Acquisition SystemsI/O InterfacesLanguages

    Pisa, September 2004

  • A short guide to ISLE/EAGLES

    http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm

    Multilingual Computational Lexicon Working Group

    Pisa, September 2004

  • Target: the Multilingual ISLE Lexical Entry (MILE)General methodological principles (from EAGLES):

    high granularity: factor out the (maximal) set of primitive units of lexical info (basic notions) with the highest degree of inter-theoretical agreementmodular and layered: various degrees of specification possibleexplicit representation of info allow for underspecification (& hierarchical structure)leading principle: edited union of existing lexicons/models (redundancy is not a problem)open to different paradigms of multilingualityoriented to the creation of large-scale & distributed lexicons

    Pisa, September 2004

  • Paths to Discover theBasic Notions of MILEclues in dictionaries to decide on target equivalentguidelines for lexicographersclues (to disambiguate/translate) in corpus concordanceslexical requirements from various types of transfer conditions & actions in MT systemslexical requirements from interlingua-based systems

    Pisa, September 2004

  • Designing MILE

    Steps towards MILE:

    Creating entries (Bertagna, Reeves, Bouillon) Identifying the MILE Basic Notions (Bertagna,Monachini,Atkins,Bouillon)Defining the MILE Lexical Model (Lenci, Calzolari, etc.)Formalising MILE (Ide)Development of the ISLE Lexical Tool (Bel)ISLE & spoken language & multimodality (Gibbon)Metadata for the lexicon (Peters, Wittenburg)A case-study: MWEs in MILE (Quochi, lenci, Calzolari)the MILE Basic Notionsthe MILE Lexical Model

    Pisa, September 2004

  • The MILE Basic Notions (the EAGLES/ISLE CLWG)Basic lexical dimensions & info-types relevant to establish multilingual linksTypology of lexical multilingual correspondences (relevant conditions & actions)

    Identified by:

    creating sample multilingual lexical entries (Bertagna, Reeves)

    investigating the use of sense indicators in traditional bilingual dictionaries (Atkins, Bouillon).

    Pisa, September 2004

  • The MILE Lexical Classes Data Categories for Content InteroperabilityFrancesca Bertagna*, Alessandro Lenci, Monica Monachini*, Nicoletta Calzolari*

    *ILCCNR Pisa Pisa University

    Pisa, September 2004

  • OverviewMILE Lexical Model with Lexical Objects and Data CategoriesMapping of existing lexicons onto MILERDF schema and DC Registry for some pre-instantiated lexical objects together with a sample entry from the PAROLE-SIMPLE lexicons in MILEFuture

    Pisa, September 2004

  • The MILE Lexical ModelGENELEXModelPAROLE-SIMPLELexiconsMultilingualLexicons(EuroWordNet, etc.)MILE Lexical ModelGuidelines syntactic semantic lexicons where after?

    Pisa, September 2004

  • The MILE Main FeaturesA general architecture devised as a common representational layer for multilingual Computational Lexiconsboth for hand-coded and corpus-driven lexical data

    Key features:ModularityGranularity Extensibility and openess - User-adaptabilityResource SharingContent InteroperabilityReusability

    Semantic Web technologies & standards applied at Lexicon modelling

    Pisa, September 2004

  • The MILE Lexical Model (MLM)The MLM core is the Multilingual ISLE Lexical Entry (MILE)a general schema for multilingual lexical resourcesa lexical meta-entry as a common representational layer for multilingual lexiconsComputational lexicons can be viewed as different instances of the MILE schemaMILELexical Modellexicon#1lexicon#3lexicon#2

    Pisa, September 2004

  • MILEthe building-block modelThe MILE architecture is designed according to the building-block model:Lexical entries are obtained by combining various types of lexical objects (atomic and complex)Users design their lexicon by:selecting and/or specifying the relevant lexical objectscombine the lexical objects into lexical entriesLexical objects may be shared: within the same lexicon (intra-lexicon reusability)among different lexicons (inter-lexicon reusability)

    Pisa, September 2004

  • MILEthe building-block model

    Pisa, September 2004

  • Modularity in MILEmulti-MILEmultilingualcorrespondenceconditionsmultiple levels of modularityHorizontal organization, where independent, but interlinked, modules allow to express different dimensions of lexical entries

    Pisa, September 2004

  • The Mono-MILEEach monolingual layer within Mono-MILE identifies a basic unit of lexical descriptionmorphological layerMUbasic unit to describe the inflectional and derivational morphological properties of the wordsyntactic layerSynUbasic unit to describe the syntactic behaviour of the MUsemantic layerSemUbasic unit to describe the semantic properties of the MU

    Pisa, September 2004

  • The Mono-MILEMUWithin each layer, a basic linguistic information unit is identified

    Pisa, September 2004

  • Granularity in MILEConcerns the vertical dimension. Within a given lexical layer, varying degrees of depth of lexical descriptions are allowed, both shallow and deep lexical representations

    Pisa, September 2004

  • Defining the MLMThe MLM is designed as an E-R model (MILE Entry Schema)defines the lexical objects and the ways they can be combined into a lexical entryThe MLM includes 3 types of lexical objects:MILE Lexical Classes (MLC)MILE Lexical Data Categories (MDC)MILE Lexical Operations (MLO)

    Pisa, September 2004

  • The MILE Lexical ObjectsWithin each layer, basic lexical notions are represented by lexical objects:MILE Lexical Classes MLCMILE Data Categories MDCLexical operationsThey are an ontology of lexical objects as an abstraction over different lexical models and architectures

    Pisa, September 2004

  • The MILE E/R diagramsThe lexical objects are described with E-R diagrams which define them and the ways they can be combined into a lexical entry

    Pisa, September 2004

  • MILE Lexical Objects: Syntactic LayerMLC:SynUMLC:SyntacticFramehasSyntacticFrameMLC:FrameSethasFrameSetMLC:CompositioncomposedbycorrespondToMLC:SemUMLC:CorrespSynUSemU1..****

    Pisa, September 2004

  • SyntacticFrameConstructionSelfSlotSlotSynUFunctionPhrase expanding one node.

    Pisa, September 2004

  • MLC:SemUMLC:SynsetbelongsToSynsetMLC:SemanticFramehasSemFrameMLC:SemanticFeaturehasSemFeatureMLC:CollocationhasCollocationsemanticRelationMLC:SemUMLC:SemanticRelationMILE Lexical Objects: Semantic Layer*0..1***

    Pisa, September 2004

  • MLC:CorrespSynUSemUMLC:SynUhasSourceSynuhasTargetSemuMLC:SemUhasPredicativeCorrespMLC:PredicativeCorrespIncludesSlotArgCorrespMLC:SlotArgCorrespMILE Lexical Objects: Synt-Sem Linking1110..*

    Pisa, September 2004

  • Syntax-Semantics LinkingCorrespSynUSemUPredCorresp

    Slot0:Arg1Slot1:Arg0

    Pisa, September 2004

  • Syntax-Semantics LinkingJohn gave the book to MaryJohn gave Mary the bookSynU#1obj_NPobl_PP_toSemU#1Semantic_Frame:GIVEArg1Agentsubj_NPSynU#2obj_NPobj_NPsubj_NPArg2ThemeArg3Goal

    Pisa, September 2004

  • CorrespSynUSemUSyntax-Semantic Linking in SIMPLETransitive structure Slot0 Slot1 SemU1_migliorareSemU2_migliorareCHANGE_OF_STATECAUSE_CHANGE_OF_STATEPRED_ migliorareARG0:Agent ARG1:Patient isomorphic non-isomorphic SynU_migliorare

    FramesetIntransitive structure Slot0 CorrespSynUSemUSlotArgCorrespSlotArgCorresp

    Pisa, September 2004

  • MultiCorrespMUMUCorresphasMUMUCorrSynUSynUCorresphasSynUSynuCorrSemUSemUCorresphasSemUSemUCorrSynsetMultCorresphasSynsetMultCorrhasSemFrameCorrSemanticFrameMultCorrespThe Multilingual layer1..01..01..01..01..0

    Pisa, September 2004

  • MILE approach to multilingualityOpen to various approaches transfer-basedmonolingual descriptions are used to state correspondences (tests and actions) between source and target entriesinterlingua-based monolingual entries linked to language-independent lexical objects (e.g. semantic frames, primitive predicates, etc.)

    Pisa, September 2004

  • The Multi-MILEMulti-MILE specifies a formal environment to express multilingual correspondences between lexical itemsSource and target lexical entries can be linked by exploiting (possibly combined) aspects of their monolingual descriptionsmonolingual lexicons act as pivot lexical repositories, on top of which language-to-language multilingual modules can be defined

    Pisa, September 2004

  • The Multi-MILEMulti-MILE may include:Multlingual operations to establish transfer links between source and target mono-MILEMultlingual lexical objectsenrich the source and target lexical descripotions, butdo not belong to the monolingual lexiconsLanguage-independent lexical objects:Primitive semantic frames, interlingual synsets, etc.Relevant for interlingua approaches to multilinguality

    Pisa, September 2004

  • Multi-MILEIT_SemU_2 En_SemU_1IT_SynU_2 En_SynU_1IT_Slot_0 EN_Slot_1IT_Slot_1 EN_Slot_0AddFeature to source SemU+HUMANAddSlot to target SynUMODIF [PP_with]

    Pisa, September 2004

  • Multi-MILEditofingertoemodif(mano)modif(piede)multilingual conditionsrun + PP_intoentrareto enter+PP_di_corsamultilingual conditionsIT LexiconEN Lexicon

    Pisa, September 2004

  • MILE Lexical ClassesRepresent the main building blocks of lexical entriesFormalize the MILE Basic NotionsDefine an ontology of lexical objectsrepresent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc. Similar to class definitions in OO languagesspecify the relevant attributesdefine the relations with other classeshierarchically structured

    Pisa, September 2004

  • MILE Lexical Classesan ontology of lexical objects

    Pisa, September 2004

  • MILE Lexical Data CategoriesMDC are instances of the MILE lexical ClassesCan be used off the shelf or as a departure point for the definition of new or modified categoriesEnable modular specification of lexical entities using all or parts of the lexical information in the repositoryEach MDC respresents a resource uniquely identified by a URITwo types of MDC:Core MDCbelong to shared repositories (Lexical Data Category Registry)lexical objects and linguistic notions with wide consensusUser Defined MLDCuser-specific or language specific lexical objects

    Pisa, September 2004

  • The MILE Data CategoriesInstances of the MILE Lexical Classes are Data CategoriesMDC can belong to a shared repository or be user-defined

    User-defined MDC

    CoreMDC

    MLC

    Pisa, September 2004

  • The MILE Data Categories User-adaptability and extensibilityHUMANARTIFACTEVENTANIMALGROUPAGEMAMMALinstance_ofCoreUserDefinedMLC:SemanticFeature

    Pisa, September 2004

  • MILE Lexical Data CategoriesMLM:FeatureMLM:GrammaticalFunction

    Pisa, September 2004

  • MILE Lexical OperationsThey are used to state conditions and perform operations over lexical entriesLink syntactic slots and semantic argumentsConstrain the syntax-semantic linkExpress tests and actions in the transfer conditions in the multi-MILEThey provide the glue to link various independent intra-lexical and inter-lexical components

    Pisa, September 2004

  • Multilingual OperationsSource-to-target language transfer conditions can be expressed by combining multilingual operationsThree types of multingual operations:Multilingual correspondencesLink a source lexical object (MU, SemU, SynU, semantic argument, syntactic slot) and a target lexical object (MU, SemU, SynU, semantic argument, syntactic slot)Add-operationsAdd lexical information relevant for the cross-lingual link, but not present in the source or target mono-MILEConstrain-operationsConstrain the transfer link to some portions of source and target mono-MILE

    Pisa, September 2004

  • Defining the MLMMILEEntry SchemaMILE LexicalClassesRDF/SDescriptions

    Pisa, September 2004

  • RDF Instantiation of the MLMLexicon#1Lexicon#2Lexicon#3ResourcesLexicalObjectsLexicalClassesLexicalData CategoriesResourcesMetadata

    Pisa, September 2004

  • MILE Lexical ModelIdeal structure for rendering in RDF:hierarchy of lexical objects built up by combining atomic data categories via clearly defined relationsProof of concept:Create an RDF schema for the MILE Lexical Modelversion 1.2Instantiate MILE Lexical Data Categories

    Pisa, September 2004

  • User-Adaptability and Resource Sharing in MILECompatible with different models of lexical analysis:Relational semantic models (e.g. WordNet)Syntactic and semantic framesOntology-based lexiconsCompatible with different degrees of specification:Deep lexical representations (e.g. PAROLE-SIMPLE)Terminological lexiconsCompatible with different paradigm of multilingualityLexicons for Transfer Based MTInterlingua-based lexicons

    Pisa, September 2004

  • The MILE Lexical ModelMILELexical Model

    Pisa, September 2004

  • RDF Instantiation of the MLMEnable universal access to sophisticated linguistic infoProvide means for inferencing over lexical info Incorporate lexical information into the Semantic Web

    W3C standards:Resource Definition Framework (RDF) Ontology Web Language (OWL) Built on the XML web infrastructure to enable the creation of a Semantic Webweb objects are classified according to their propertiessemantics of relations (links) to other web objects precisely defined

    Pisa, September 2004

  • The RDF SchemaDefines classes of objects (MLC) and their relations to other objectsLike a class definition in Java, etc.Classes and properties in the schema correspond to the E-R model Can specify sub-classes/sub-properties and inheritance

    Pisa, September 2004

  • GoalsLexical information will form a central component of semantic informationNeed a standardized, machine processable format so that information can be used, merged with othersMain task: get the data model rightSee Semantic Web

    Pisa, September 2004

  • Advantages of RDFModularityCan create instances of bits of lexical information for re-use in a single lexicon or across lexiconsInstances can be stored in a central repository for use by othersCan use partial information or all of itBuilding block approach to lexicon creationWeb-compatibleRDF instantiation will integrate into Semantic WebInferencing capabilities

    Pisa, September 2004

  • ExampleThree parts:RDF Schema for lexical entriesDefines classes and properties, sub-classes, etc.Sample repository of RDF-instantiated lexical objectsThree levels of granularitySample lexicon entriesUse repository information at different levels

    Pisa, September 2004

  • Sample Repositoriesrepository of enumerated classes for lexical objects at the lowest level of granularitydefinition of sets of possible values for various lexical objectsrepository of phrases for common phrase types, e.g., NP, VP, etc.repository of constructions for common syntactic constructions

    Pisa, September 2004

  • Subj Obj Comp Arg Iobj

    tense gender control person aux

    have be subject_control object_control masculine feminine

    Enumerated classes

    Pisa, September 2004

  • Sample LDCR for a Phrase Object

    Pisa, September 2004

  • Sample LDCR entry for a Construction object

    Pisa, September 2004

  • Full entry

    John ate the cake Continued

    Pisa, September 2004

  • Continued from previous slide

    Pisa, September 2004

  • Entry Using Phrase John ate the cake

    Pisa, September 2004

  • Entry Using Construction

    John ate the cake

    Pisa, September 2004

  • Semantic RepresentationThe data model underlying RDF/UML, etc. is universal, abstract enough to capture all types of infoSemantic representations:Registry of basic data categoriesmeta-categories: addressee, utterance, etc. Information categories: eyebrow movement, gestures, pitch, Supporting ONTOLOGY of information categoriesInterpretative procedures yield another level of meaning represent.Registry of categories.UNINTERPRETED REPRESENATIONINTERPRETATIONPROCESSINTERPRETED REPRESENTATION

    Pisa, September 2004

  • MILE Lexical Data Category Registry (MDC)Instantiation of pre-defined lexical objectsExtension of the shared class schema with lexicon-specific sub-classes and sub-propertiesCan be used off the shelf or as a departure point for the definition of new or modified categories Enables modular specification of lexical entitieseliminate redundancyidentify lexical entries or sub-entries with shared properties

    Pisa, September 2004

  • MLC in RDF/S featuresmlm:LexObjectmlm:Valuesmlm:featuremlm:SemValuesmlm:SynValuesrdfs:subClassOfmlm:semFeaturerdfs:subClassOfmlm:synFeaturerdfs:subPropertyOffeatures are properties of lexical objects

    Pisa, September 2004

  • MLC in RDF/S syntactic features
  • MLC in RDF/S semantic features
  • Synsets in RDF/Smlm:Synsetrdfs:literalmlm:wordmlm:Synsetmlm:synsetRelationmlm:Valuesrdfs:literalmlm:glossmlm:featurecf. also http://www.semanticweb.org/library/wordnet/wordnet-20000620.rdfs

    Pisa, September 2004

  • Synsets in RDF/S

    SynsetThis class formalizes the notion of synset as defined in WordNet (Fellbaum 1998).

    The WordNet hypernym relation

    The WordNet meronym relation

    relation between synsetsdifferent types of synset relations

    Pisa, September 2004

  • WordNet 1.7 Synsets
  • Foundations of the Mapping Experiment

    Pisa, September 2004

  • 1. The MILE building-block modelThe MILE Lexical Classes and the MILE Lexical Data Categories are the main building blocks of the MILE lexical architecture

    Building blocks allow two kinds of reusability: intra-lexicon reusability (within the same lexicon) inter-lexicon reusability (among different lexicons)

    Pisa, September 2004

  • How building-blocks work?

    Pisa, September 2004

  • 2. MILE: a meta-entryMILE isa general schema for multilingual lexical resourcesa lexical meta-entry, a common representational layer for multilingual lexiconsComputational lexicons can be viewed as different instances of the MILE schema

    MILE

    lexicon#1lexicon#3lexicon#2

    Pisa, September 2004

  • MILE and Content InteroperabilityThis common shared compatible representation of lexical objects is particularly suited to manipulate objects available in different lexical resourcesunderstand their deep semanticsapply the same operations to lexical objects of the same type

    key elements of Content Interoperability

    Pisa, September 2004

  • The Mapping Experiment: Why?It is a concrete experiment aimed to test the expressive potentialities and capabilities of the MILEThe idea is that if the MILE atomic notions combined together in different ways suit the different visions underlying two lexicons such as FrameNet and NOMLEX, the MILE will come out fortified its adoption as an interface between differently conceived lexical architectures can be pushed morekey issues for content interoperability between resources can be addressed

    Pisa, September 2004

  • The mapping scenariosHigh level mapping of the objects of a lexicon into the objects of the abstract model the native structure is maintained and no format conversion is performed

    Translate instances of lexical entries directly in MILE acts as a true interchange format

    Pisa, September 2004

  • FrameNet to MILE

    Pisa, September 2004

  • FrameNet-MILE: ObservationsThe mapping is promisingFrame Predicate (primitive) Frame Elements Argument (enlarge the set of possible values)Lexical_Unit SemULink SemU-Predicate (obligatory) should become underspecified

    But Lack of inheritance mechanism in the Predicate does not allow to represent the hierarchical organization of Frames and Sub-frames, temporal ordering among Frames, subsumption relations among FramesWe could add a new object PredicateRelation to allow for the description of relations occurring between predicates and sub-predicates

    Pisa, September 2004

  • MLC:SynUMLC:SemUMLC:SemanticFrameTypeOfLinkAgentnomIncludedArg 0

    MLC:PredicateMLC:ArgumentMLC:ArgumentMLC:CorrespSynUSemU:nom-type ((subject))

    Pisa, September 2004

  • NOMLEX-MILE: ObservationsThe mapping is promisingNotions represented in NOMLEX have a correspondent in MILE

    But .. are expressed with two opposite lexical structuresIn NOMLEX, lexical information is expressed in a very compact wayno clear cut boundaries between the levels of linguistic descriptionIn MILE compressed info should be decompressed and spread over different MILE lexical layers and objects: SynU, SemU, SemanticFrame with its Predicate and relevant Arguments to account for the incorporation of the Agent.

    Pisa, September 2004

  • Lesson Learned from the mappingThe results of the experiments are promisingFrameNet offers the possibility to be confronted with two similar lexical models, but not perfectly overlapping lexical objects test the adequacy of the linguistic objectsNOMLEX gives the opportunity to work with two lexicons where linguistic notions correspond but are expressed with an opposite lexicon structure test the adequacy of the architectural modelThe high granularity and modularity of MILE allow the compatibility with differently packaged linguistic objectsallow the addition of new objects and relations without perverting the general architecture

    Pisa, September 2004

  • RDF and MILE: Why?Some reasons (from Nancy Ide et al. 2003)MILE as a hierarchy of lexical objects built up by combining data categories via clearly defined relations is an ideal structure for rendering in RDFRDF mechanism, with the capacity of expressing named relations between objects, offers a web-based means to represent the MILE architectureRDF representation of linguistic information is an invaluable resource for language processing applications in the Semantic WebRDF description and instantiation is in line with the goal of ISO TC37 SC4

    Pisa, September 2004

  • RDF Representation of MILEMILE was already supplied withan RDF schema for the MILE Syntactic Layeran instantiation of pre-defined syntactic objectsWe increased the repository of shared lexical objects with the RDF description and (partial!) instantiations of the objects of the semantic and linking layersThis has been carried out with the intent to be submitted within the ISO TC37/SC4foster the adoption of MILE, by offering a library of RDF objects ready-to-use

    Pisa, September 2004

  • An RDF Schema for the synt-sem linking

    CorrespSynUSemU This class links a SynU to a SemU

    PredicativeCorresp This class contains the associations between the syntactic slots and semantic argument

    SlotArgCorresp This class links a syntactic slots to a semantic argument Classes

    Pisa, September 2004

  • An RDF Schema for the synt-sem linking

    hasSourceSynU

    hasTargetSemU

    hasPredicativeCorresp

    includesSlotArgCorresp

    Properties

    Pisa, September 2004

  • The library of Pre-instantiated objectsEnable modular specification of lexical entitieseliminate redundancyidentify lexical entries or sub-entries with shared propertiescreate ready-to-use packages that can be combined in different waysCan be used off the shelf or as a departure point for the definition of new or modified categories

    Pisa, September 2004

  • MDCR for some objects

  • A Sample Entry in MILE The entry is shown in a double alternative: the full specification of a lexical object PredicativeCorrespan already instantiated object PredicativeCorrespThe advantage is that the object does not need to be specified in the entry and can be used and reused in other entriesexplore the potential of MILE for representation of lexical data

    Pisa, September 2004

  • Sample full entry for amareV

    The full object PredicativeCorresp

    Pisa, September 2004

  • the abbreviated entry

    Instantiated object PredicativeCorresp

    Pisa, September 2004

  • The RDF Schema, the DCR for MILE objects and the entries are available atwww.ilc.cnr.it/clips/rdf/

    Pisa, September 2004

  • and INTERA? INTERA Multilingual Terminological Lexica will follow and merge the two frameworks

    The MILE and ISO TMF (Terminological Markup Framework)

    Pisa, September 2004

  • Beyond MILE: future workMILE Lexical Model oriented towards an Open Distributed Lexical Infrastructure:

    Lexical Information Servers for multiple access to lexical information repositoriesEnhance user-adaptivityresource sharingcooperative creationDevelop integration and interchange tools

    Pisa, September 2004

  • Broadening MILE: ... other languagesOngoing enlargement to Asian languages (Chinese, Japanese, Korean, Thai, Hindi ...)promote common initiatives between Asia & Europe (e.g. within the EU 6th FP)The creation of an Open Distributed Lexical Infrastructure, also supported by Asian Institutions: AFNLPUniversity of Tokyo (Dept. of Computer Science)Korean KAIST and KORTERMAcademia Sinica (Taiwan)

    To valorise results & increase visibility of LR & standardisation initiatives in a world-wide context, while concretely promoting the launching of a new common platform for multilingual LR creation & management

    Pisa, September 2004

  • Using semantically tagged corpora to acquire semantic info and enhance Lexicons evaluate the disambiguating power of the semantic types of the lexiconassess the need of integrating lexicons with attested senses and/or phraseologyidentify the inadequacy of sense distinctions in lexiconscheck actual frequency of known senses in different text typeshave a more precise and complete view on the semantics of a lemma identify the most general sensescapture the most specific shifts of meaning

    Capture just the core, basic distinctions in a core lexicon Corpus analysis must not lead to excessive granularity of sense distinctions, but draw a distinction between sense discrimination to be kept under control - clustering (manually or automatically) additional, more granular information (often of collocational nature) which can/must be acquired/encoded within the broader senses, e.g. to help translation

    Pisa, September 2004

  • Dynamic lexiconCurrent computational lexicons (even WordNets) are static objects, still shaped on traditional dictionaries suffering from the limitations induced by paper support

    Thinking at the complex relationships between lexicon and corpus towards a flexible model of dynamic lexicon extending the expressiveness of a core static lexicon adapting to the requirements of language in use as attested in corpora with semantic clustering techniques, etc.

    Convert the extreme flexibility & multidimensionality of meaning into large-scale and exploitable (VIRTUAL?) resourcesa Lexicon and Corpus together

    Pisa, September 2004

  • What to annotate?Mix of:Word-sense annotation (implicit semantic markup)Semantic/conceptual markup

    Syntagmatic relationsDependency relations Semantic roles

    Pisa, September 2004

  • Need for a common Encoding Policy ?Agree on common policy issues? is it feasible? desirable? to what extent?

    This would imply, among others:

    analysis of needs also applicative/industrial - before any large development initiative base semantic tagging on commonly accepted standards/guidelines ??up to which level?Common semantic tagset: Gold Standard??

    build a core set of semantically tagged corpora, encoded in a harmonised way, for a number of languages??make annotated corpora available to the community by largeinvolve the community, collect and analyse existing semantically tagged corpora devise common set of parameters for analysis

    Pisa, September 2004

  • A few Issues for discussion:MILE & lexicon standardsMore standardisation initiatives?MILE - a general schema for encoding multilingual lexical info, as a meta-entry, as a common representational layer Short & medium term requirements wrt standards for multilingual lexicons and content encoding, also industrial requirementsRelation with Spoken language community (see ELRA)Semantic Web standards & the needs of content processing technologies: importance of reaching consensus on (linguistic & non-linguistic) content, in addition to agreement on formats & encoding issues (words convey content & knowledge)Define further steps necessary to converge on common priorities

    Pisa, September 2004

  • Broadening MILE: ... other communitiesNLP, lexicons, terminologies, ontologies, Semantic Web: a continuum?

    Knowledge management is critical. For content interoperability, need to converge around agreed standards also for the semantic/conceptual level is the field mature enough to converge around agreed standards also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?Is the field of multilingual lexical resources ready to tackle the challenges set by the Semantic Web development?

    Foster better integration with corpus-driven dataterminology/ontology/semantic web communitiesmultimodal & multimedial aspectsOriented towards open, distributed lexical resources:Lexical Information Servers for multiple access to lexical information repositories

    Pisa, September 2004

  • A few Issues for discussion:NLP, lexicons, content, ontologies, Semantic Web: a continuum?Need for robust systems, able to acquire/tune multilingual lexical/linguistic/conceptual knowledge, to auto-enrich static basic resourcesRelation betw. lexical standards & acquisition & text annotation protocols

    Pisa, September 2004

  • Target.. Multilingual Knowledge Management Technical Feasibility:

    Prerequisite: is it an achievable goal a commonly agreed text/lexicon annotation protocol also for the semantic/conceptual level (to be able to automatically establish links among different languages)?

    Yes, at the lexical level

    More complex, for corpus annotation?

    EAGLES/ISLE

    Pisa, September 2004

  • To make the Semantic Web a reality ...need to tackle the twofold challenge of content availability & multilinguality

    Natural convergence with HLT:multilingual semantic processingontologiessemantic-syntactic computational lexicons

    Pisa, September 2004

  • enables a new role of Multilingual Lexicons: to become essential component for the Semantic WebLanguage - & lexicons - are the gateway to knowledgeSemantic Web developers need repositories of words & terms - & knowledge of their relations in language use & ontological classificationThe cost of adding this structured and machine-understandable lexical information can be one of the factors that delays its full deploymentThe effort of making available millions of words for dozens of languages is something that no small group is able to afford

    A radical shift in the lexical paradigm - whereby many participants add linguistic content descriptions in an open distributed lexical framework - required to make the Web usable

    Pisa, September 2004

  • Beyond MILE: next steps... . towards an Open Distributed Lexical InfrastuctureCreate a first repository of shared lexical entries extracted from different lexical resources & mapped to MILE (choosing e.g. lexical entries in areas related to the Olympic Games)to test mapping different lexicon models to MILEprovide a grid with all the ISLE Basic Notions, short descriptions, attributes and sub-elements,to be filled with the correspondent "notionsCreate a list (Open Lexicon Interest Group)

    ...LanguageEnhance user-adaptivity, resource sharing, cooperative creation & managementLexical Information Servers for multiple access to lexical information repositoriesKnowledge

    Pisa, September 2004

  • A new paradigm for a new generation of LR?

    New Strategic Vision

    towards a Distributed Open Lexical Infrastructure

    Focus on cooperation,

    also between different communities for distributed & cooperative creation, management, etc. of Lexical Resources MILE as a common platform

    technical & organisational requirements

    Pisa, September 2004

  • Beyond MILE: towards open & distributed lexiconsSemantic LexiconURI = http://www.xxxSyntactic ConstructionsURI = http://www.yyyOntologyURI = http://www.zzzMonolingual/Multilingual LexiconLex_object: semFeatureURI = http://www.xxx#HUMANLex_object: syntagmaNTURI = http://www.zzz#NPcorpora

    Pisa, September 2004

  • A few issues for the future...Integration betw. WLR/SLR/MMR (see e.g. LREC)

    Integration betw. LRs & SemWeb

    Integration of Lexicons/Terminologies/Ontologies: towards Knowledge Resources

    Multilingual Resources: an open infrastructure

    Integration of Lexicon/Corpus (see e.g. Framenet)

    Parallel evolution of LRs & LTechnology

    Pisa, September 2004

  • from Computational Lexicons to Knowledge ResourcesUnified framework for lexicons, ontologies, terminologies, etc.

    Towards an open, distributed infrastructure for lexical resourcesLexical Information Serversflexible and extensibleintegrated with multimodal and multimedial dataintegrated with Web technologyrelated initiatives: INTERA, ICWLRE

    Pisa, September 2004

  • with a world-wide participation looking for an appropriate call

    .. pushing to launch an Open & Distributed Lexical Infrastructure

    for content description and content interoperability,

    to make lexical resources usable within the emerging Semantic Web scenario

    for Language Resources & Semantic Web.

    Pisa, September 2004

  • How to go to a framework allowing incremental creation/merging/How to:"organise" creation/acquisition of multilingual LRs: evaluate different modelscope with/affect maintenanceorganise technology transfer among languagessupport BLARK (a commonly agreed list of minimal requirements for national LRs)launch an international initiative linking Semantic Web & LRsbootstrap this by "opening" a few LRsrole of standards

    Pisa, September 2004

  • Lexical WEB & Content InteroperabilityAs a critical step for semantic mark-up in the SemWeb

    ComLexSIMPLEWordNetsWordNetsWordNetsFrameNetLex_xLex_yMILEwith intelligent agents??NomLex

    Pisa, September 2004

  • A new paradigm for a new generation of LRs?Cross-linguallinks

    Pisa, September 2004

    The Italian PAROLE and SIMPLE lexicons constitute the basis for the CLIPS lexiconthat is being enlarged with a set of lexical units selected from the PAROLE corpusAt the end of the project, the CLIPS lexicon will consist of 55,000 lemmas encoded at the phonological, morphological andsyntactic level and of 55,000 semantic units.

    Now, I would like to focus on two aspects of CLIPS:the link between syntactic and semantic information andthe way the information encoded in the Extended Qualia Structure can be exploitedThen, the predicate linked to the semantic units is related to the syntactic frame, and more precisely EACH SEMANTIC ARGUMENT, WITH ITS BUNDLE OF INFORMATION, IS RELATED TO THE CORRESPONDING FRAME POSITION OF THE RELEVANT SYNTACTIC STRUCTURE HenceThrough the cause change-typed Semantic unit, the predicate is related to the transitive syntactic structure by means of a bivalent isomorphic relation holding between arguments and syntactic positions, while through the change-typed one it is linked to the intransitive structure through a non-isomorphic relation indicating that: (ARG0:agent) does not map on any syntactic position while (ARG1:patient) maps on P0.


Recommended