Post on 21-Jun-2015
description
transcript
Semantic Mapping inCLARIN Component Metadata
Matej DurcoInstitute for Corpus Linguistics and Text Technology
matej.durco@assoc.oeaw.ac.at
Menzo WindhouwerThe Language Archive - DANS
menzo.windhouwer@dans.knaw.nl
MTSR 2013
Thessaloniki, Greece
Outline
CLARIN an european infrastructure for language resources Component Metadata Infrastructure (CMDI) Semantic Mapping in CMDI Semantic mapping in the CLARIN joint metadata domain Conclusions and future work
CLARIN
CLARIN = Common Language Resources and Technology Infrastructure = an european ESFRI infrastructure project
Aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyze or combine them, independent of where they are located. Building a networked federation of European data
repositories, service centers and centers of expertise. One pillar of this infrastructure is a joint metadata domain
http://www.clarin.eu/
Component Metadata Infrastructure
Rationale for CMDI Limitations of existing metadata schemas (OLAC/DCMI, IMDI,
TEI header) Inflexible: too many (IMDI) or too few (OLAC) metadata
elements Limited interoperability (both semantic and syntactic) Problematic (unfamiliar) terminology for some sub-communities. Limited support for LT tool & services descriptions
CMDI addresses this by: Explicit defined schema & semantics User/project/community defined components
http://www.clarin.eu/cmdi/
CMDI - example
TechnicalMetadata
Sample frequency
Format
Size
…
Lets describe a speech recording
CMDI - example
Language
TechnicalMetadata
Name
Id
…
Lets describe a speech recording
CMDI - example
Language
TechnicalMetadata
Actor
Sex
Language
Age
Name
…
Lets describe a speech recording
CMDI - example
Language
TechnicalMetadata
Actor
Location
…
Continent
Country
Address
Lets describe a speech recording
CMDI - example
Language
TechnicalMetadata
Actor
Location
Project…
Name
Contact Lets describe a speech recording
Metadata Profile
CMDI - example
Language
TechnicalMetadata
Actor
Location
Project
Metadata schema(W3C XML Schema)
Metadata description(XML document)
Lets describe a speech recording
CMDI - workflow
OAI-PMHData provider
OAI-PMHService provider
Localmetadatarepository
Joint metadatarepository
metadatamodeler
metadatauser
metadatacreator
componentregistry &
editor
metadataeditor
metadatacurator
metadatacurator
metadatacatalogue
RelationRegistry
search &semantic mapping
DATA
ISOcat
A CMD component, element or value should be linked to a ‘concept’, i.e., an URI that points to a semantic description ‘concepts’ can be shared indicating shared semantics
Current components use mainly: Dublin Core elements or terms ISOcat Data Categories
ISOcat (www.isocat.org) is an ISO 12620:2009 compliant Data Category Registry allows ellaborate specifications, e.g., a definition, (alternative)
names, examples, explanations, value domains (all in various languages)
can be freely used by anyone, including the creation of new data categories
the Athens Core group has created many metadata data categories inspired by OLAC, TEI Header and IMDI
Semantic Mapping in CMDI
Semantic Registry
Language Name : A human understandable name of the language that ...
LanguageName
Id
…
Language ID : Identifier of the language as defined by ISO 639 that …
DictionaryLanguage
Author
…
Semantic Mapping in CMDI
Semantic Mapping in CMDI
Due to the use of multiple ‘concept’ registries and the open nature of some of them (almost) same-as relationships have to be specified RELcat (under development) is a Relation Registry which
allows to store these in, possibly user or community specific, sets
time coverageisocat:DC-1502
dc:coveragerelcat:subClassOf
language nameisocat:DC-2484
language IDisocat:DC-2482
dc:language
relcat:sameAs
relcat:sameAs
CMDI in CLARIN
2011-01 2012-06 2013-01 2013-06
Profiles 40 53 87 124
Components 164 298 542 828
Elements 511 893 1505 2399
Distinct Data Categories (DCs)
203 266 436 499
Metadata DCs 277 712 774 791
% Elements w/o DCs 24.7% 17.6% 21.5% 26.5%
CMD profiles for existing metadata schemas like OLAC/DCMI, TEI Header and META-SHARE have been created
Profiles differ a lot in structure: Small and flat profiles with 5 – 10 elements Large and complex profiles of up to 10 component levels with hundreds of elements
Around half a million CMD records are harvested from around 70 providers
http://catalog.clarin.eu/vlo/
CMD Semantic Mapping in CLARIN
791 metadata Data Categories 222 from Athens Core (recommended) 2 showcases (of very common concepts):
Language Name
SMC (Semantic Mapping Component) Browser http://clarin.aac.ac.at/smc-browser Allows the metadata modeller to explore the semantic overlap
between profiles, components and elements in an interactive graph
Language LanguageID (http://www.isocat.org/datcat/DC-2482) languageName (http://www.isocat.org/datcat/DC-2484) Linked in the RelationRegistry with the Dublin Core term
language http://lux13.mpi.nl/relcat/set/cmdi (graph)
Together these ‘concepts’ are linked with 80 profiles
Other related language Data Categories could be considered sourceLanguage, languageMother
The Relation Registry allows to include them to maximize the recall for a specific language
CMD Semantic Mapping in CLARIN
CMD Semantic Mapping in CLARIN
CMD Semantic Mapping in CLARIN
Name Is a more ambiguous term used by 72 CMD elements 12 different Data Categories are used by these elements
resourceName (http://www.isocat.org/datcat/DC-2544) resourceTitle (http://www.isocat.org/datcat/DC-2545) author (http://www.isocat.org/datcat/DC-4115) contact full name (http://www.isocat.org/datcat/DC-2454) dcterms:Contributor ...
A naive search on ‘name’ would yield semantically very heterogenous results, instead use The ‘concept’ links Context, i.e., the enclosing components of an element
Conclusion & future work
The CMD Infrastructure is very flexible with regard to metadata structures, but also provides an integrated semantic layer to achieve semantic interoperability
All the proper registries are in place and prove to be useful, e.g., by the central CLARIN catalogue Users can search and navigate the metadata based on semantics
and are not directly confronted with the structural diversity Furture work: sometimes more context is needed for disambiguation
However, for metadata modellers the percieved proliferation of reusable profiles and component can be a burden The SMC browser gives already insight in (semantic) overlap and
differences Future work: statistics based on the instance data will also help to
select among profiles and components