+ All Categories
Home > Health & Medicine > Semantic Mapping in CLARIN Component Metadata.

Semantic Mapping in CLARIN Component Metadata.

Date post: 21-Jun-2015
Category:
Upload: menzo-windhouwer
View: 907 times
Download: 2 times
Share this document with a friend
Description:
M. Durco, M. Windhouwer. Semantic Mapping in CLARIN Component Metadata. In E. Garoufallou and J. Greenberg (eds.), Metadata and Semantics Research (MTSR 2013; mtsr2013.teithe.gr), CCIS Vol. 390, Springer, Thessaloniki, Greece, November 20-22, 2013.
Popular Tags:
20
Semantic Mapping in CLARIN Component Metadata Matej Durco Institute for Corpus Linguistics and Text Technology [email protected] Menzo Windhouwer The Language Archive - DANS [email protected] MTSR 2013 Thessaloniki, Greece
Transcript
Page 1: Semantic Mapping in CLARIN Component Metadata.

Semantic Mapping inCLARIN Component Metadata

Matej DurcoInstitute for Corpus Linguistics and Text Technology

[email protected]

Menzo WindhouwerThe Language Archive - DANS

[email protected]

MTSR 2013

Thessaloniki, Greece

Page 2: Semantic Mapping in CLARIN Component Metadata.

Outline

CLARIN an european infrastructure for language resources Component Metadata Infrastructure (CMDI) Semantic Mapping in CMDI Semantic mapping in the CLARIN joint metadata domain Conclusions and future work

Page 3: Semantic Mapping in CLARIN Component Metadata.

CLARIN

CLARIN = Common Language Resources and Technology Infrastructure = an european ESFRI infrastructure project

Aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyze or combine them, independent of where they are located. Building a networked federation of European data

repositories, service centers and centers of expertise. One pillar of this infrastructure is a joint metadata domain

http://www.clarin.eu/

Page 4: Semantic Mapping in CLARIN Component Metadata.

Component Metadata Infrastructure

Rationale for CMDI Limitations of existing metadata schemas (OLAC/DCMI, IMDI,

TEI header) Inflexible: too many (IMDI) or too few (OLAC) metadata

elements Limited interoperability (both semantic and syntactic) Problematic (unfamiliar) terminology for some sub-communities. Limited support for LT tool & services descriptions

CMDI addresses this by: Explicit defined schema & semantics User/project/community defined components

http://www.clarin.eu/cmdi/

Page 5: Semantic Mapping in CLARIN Component Metadata.

CMDI - example

TechnicalMetadata

Sample frequency

Format

Size

Lets describe a speech recording

Page 6: Semantic Mapping in CLARIN Component Metadata.

CMDI - example

Language

TechnicalMetadata

Name

Id

Lets describe a speech recording

Page 7: Semantic Mapping in CLARIN Component Metadata.

CMDI - example

Language

TechnicalMetadata

Actor

Sex

Language

Age

Name

Lets describe a speech recording

Page 8: Semantic Mapping in CLARIN Component Metadata.

CMDI - example

Language

TechnicalMetadata

Actor

Location

Continent

Country

Address

Lets describe a speech recording

Page 9: Semantic Mapping in CLARIN Component Metadata.

CMDI - example

Language

TechnicalMetadata

Actor

Location

Project…

Name

Contact Lets describe a speech recording

Page 10: Semantic Mapping in CLARIN Component Metadata.

Metadata Profile

CMDI - example

Language

TechnicalMetadata

Actor

Location

Project

Metadata schema(W3C XML Schema)

Metadata description(XML document)

Lets describe a speech recording

Page 11: Semantic Mapping in CLARIN Component Metadata.

CMDI - workflow

OAI-PMHData provider

OAI-PMHService provider

Localmetadatarepository

Joint metadatarepository

metadatamodeler

metadatauser

metadatacreator

componentregistry &

editor

metadataeditor

metadatacurator

metadatacurator

metadatacatalogue

RelationRegistry

search &semantic mapping

DATA

ISOcat

Page 12: Semantic Mapping in CLARIN Component Metadata.

A CMD component, element or value should be linked to a ‘concept’, i.e., an URI that points to a semantic description ‘concepts’ can be shared indicating shared semantics

Current components use mainly: Dublin Core elements or terms ISOcat Data Categories

ISOcat (www.isocat.org) is an ISO 12620:2009 compliant Data Category Registry allows ellaborate specifications, e.g., a definition, (alternative)

names, examples, explanations, value domains (all in various languages)

can be freely used by anyone, including the creation of new data categories

the Athens Core group has created many metadata data categories inspired by OLAC, TEI Header and IMDI

Semantic Mapping in CMDI

Page 13: Semantic Mapping in CLARIN Component Metadata.

Semantic Registry

Language Name : A human understandable name of the language that ...

LanguageName

Id

Language ID : Identifier of the language as defined by ISO 639 that …

DictionaryLanguage

Author

Semantic Mapping in CMDI

Page 14: Semantic Mapping in CLARIN Component Metadata.

Semantic Mapping in CMDI

Due to the use of multiple ‘concept’ registries and the open nature of some of them (almost) same-as relationships have to be specified RELcat (under development) is a Relation Registry which

allows to store these in, possibly user or community specific, sets

time coverageisocat:DC-1502

dc:coveragerelcat:subClassOf

language nameisocat:DC-2484

language IDisocat:DC-2482

dc:language

relcat:sameAs

relcat:sameAs

Page 15: Semantic Mapping in CLARIN Component Metadata.

CMDI in CLARIN

2011-01 2012-06 2013-01 2013-06

Profiles 40 53 87 124

Components 164 298 542 828

Elements 511 893 1505 2399

Distinct Data Categories (DCs)

203 266 436 499

Metadata DCs 277 712 774 791

% Elements w/o DCs 24.7% 17.6% 21.5% 26.5%

CMD profiles for existing metadata schemas like OLAC/DCMI, TEI Header and META-SHARE have been created

Profiles differ a lot in structure: Small and flat profiles with 5 – 10 elements Large and complex profiles of up to 10 component levels with hundreds of elements

Around half a million CMD records are harvested from around 70 providers

http://catalog.clarin.eu/vlo/

Page 16: Semantic Mapping in CLARIN Component Metadata.

CMD Semantic Mapping in CLARIN

791 metadata Data Categories 222 from Athens Core (recommended) 2 showcases (of very common concepts):

Language Name

SMC (Semantic Mapping Component) Browser http://clarin.aac.ac.at/smc-browser Allows the metadata modeller to explore the semantic overlap

between profiles, components and elements in an interactive graph

Page 17: Semantic Mapping in CLARIN Component Metadata.

Language LanguageID (http://www.isocat.org/datcat/DC-2482) languageName (http://www.isocat.org/datcat/DC-2484) Linked in the RelationRegistry with the Dublin Core term

language http://lux13.mpi.nl/relcat/set/cmdi (graph)

Together these ‘concepts’ are linked with 80 profiles

Other related language Data Categories could be considered sourceLanguage, languageMother

The Relation Registry allows to include them to maximize the recall for a specific language

CMD Semantic Mapping in CLARIN

Page 18: Semantic Mapping in CLARIN Component Metadata.

CMD Semantic Mapping in CLARIN

Page 19: Semantic Mapping in CLARIN Component Metadata.

CMD Semantic Mapping in CLARIN

Name Is a more ambiguous term used by 72 CMD elements 12 different Data Categories are used by these elements

resourceName (http://www.isocat.org/datcat/DC-2544) resourceTitle (http://www.isocat.org/datcat/DC-2545) author (http://www.isocat.org/datcat/DC-4115) contact full name (http://www.isocat.org/datcat/DC-2454) dcterms:Contributor ...

A naive search on ‘name’ would yield semantically very heterogenous results, instead use The ‘concept’ links Context, i.e., the enclosing components of an element

Page 20: Semantic Mapping in CLARIN Component Metadata.

Conclusion & future work

The CMD Infrastructure is very flexible with regard to metadata structures, but also provides an integrated semantic layer to achieve semantic interoperability

All the proper registries are in place and prove to be useful, e.g., by the central CLARIN catalogue Users can search and navigate the metadata based on semantics

and are not directly confronted with the structural diversity Furture work: sometimes more context is needed for disambiguation

However, for metadata modellers the percieved proliferation of reusable profiles and component can be a burden The SMC browser gives already insight in (semantic) overlap and

differences Future work: statistics based on the instance data will also help to

select among profiles and components


Recommended