BringingEuropeana andCLARINtogether:
Disseminationandexploitationofculturalheritagedata
inaresearchinfrastructureTwan Goosen1 (CLARINERIC),Nuno Freire2,ClemensNeudecker3,MariaEskevich1
1 CLARINERIC;2 Europeana /INESC-ID;3BerlinStateLibrary/Europeana Newspapers
DigitalInfrastructuresforresearch(DI4R)2017
Brussels,BE
30November2017
Europeana insix bullets
• Europeana istheEuropeandigitalplatformforculturalheritagethat• seekstoenableuserstosearchandaccessknowledgeinallthelanguagesofEurope,eitherdirectlyviaitswebportals,orindirectlyviathird-partyapplicationsleveragingitsdataservice• Europeana enablespeopletoexplorethedigitalresourcesofEurope'sgalleries,museums,libraries,archivesandaudiovisualcollections• workingwithpartnersandalliestodevelopframeworks,standards,strategyandpolicyrelevanttodigitalculturalheritage,andtoraisefunds• providingdigitalexpertiseandplatformsforbringingculturalheritagetowideraudiences• championingtheuseofdigitised culturalheritageineducation,researchandthecreativeindustriesthroughpartnershipsandinternationalengagementcampaigns
2
CLARINinseven bullets
• CLARINistheCommonLanguageResourcesandTechnologyInfrastructure
• ESFRI ERICstatussince2012,Landmarksince2016• thatprovideseasyandsustainableaccessforscholarsinthehumanitiesandsocialsciences andbeyond• todigitallanguagedata (inwritten,spoken,videoormultimodalform)• andadvancedtools todiscover,explore,exploit,annotate,analyse orcombinethem,wherevertheyarelocated• throughasinglesign-ononlineenvironment• andthatservesasanecosystemforknowledgesharing
3
4
CLARINERICinmembersand centres
Aconsortium of:• 19members:AT,BG,CZ,DE,DK,DLU,EE,FI,GR,HU,IT,LT,LV,NL,NO,PL,PT,SE,SI• 2observers:FR,UK;• >40centres
CLARIN&Europeana partnershipincontextofDSIDigitalServiceInfrastructure (DSI):Creationofacomplete,cohesiveandintegratedDigitalServiceInfrastructure• DSI(01.2015– 06.2016):
– EuropeanResearchDistributionPlan– AssessmentofrelevantdatasetsavailablefromTheEuropeanLibrary(TEL)
• DSI-2(07.2016– 08.2017):– Improvementofdataqualityandimplementationofqualityframeworksto
improvemetadataquality– IntegrationofEuropeana dataintoCLARINinfrastructure
• DSI-3(09.2017– 08.2018):– Fosteringcontentsupplybyoptimising Europeana dataandaggregation
infrastructure– Improving(meta-)dataandcontentquality– Fosteringreuseofdigitalculturalheritageresourcesbyimprovingcontent
distributionmechanisms– Maintainaninternationalinteroperablelicensingframework
5
StepstowardsCLARIN&Europeana interoperability
1) IncorporateEuropeana metadataintheVLO
2) Openingupthefull-textEuropeana NewspapersresourcessuchasthosefromEuropeana NewspapersthroughCLARIN’sfederatedcontentsearchmechanism(FCS)
3) ExploitingCLARIN’scommunicationchannelstoincreasetheawarenessofEuropeana withinthecommunity
4) MeasureimpactofthedisseminationofEuropeana data
6
Metadata:accesstoculturalheritage
• Aggregationandexploitationof(meta)dataaboutdigitised objectsfromverydifferentcontexts.• Europeana DataModel(EDM)asitsmodelforinteroperabilityofmetadata,inlinewiththevisionoflinkedopenvocabularies
7
• Aggregationofmetadatafromresourceproviders(CLARINcentresandselected“external”parties)• VirtualLanguageObservatory(VLO)providesauniformexperienceandconsistentworkflow.• LanguageResourceSwitchboard(LRS)allowsresearcherstoinvoketoolswiththeselectedresourcesdirectlyfromitsuserinterface.
Challenge:CLARINandEuropeana donotshareacommonmetadatamodel
TheCLARINdataarchitecture:repositories
8
Repository at a CLARIN centre
Language Data Metadata Language
Tools
describes
single text or recording
!corpus
!lexicon
!wordnet
!grammar
!…
web application !
web service !
web service pipeline
!stand-alone application
!…
TheCLARINdataarchitecture:harvesting
9
Language Data Metadata Language
ToolsLanguage
Data Metadata Language Tools
Harvested Metadata
Language Data Metadata Language
ToolsLanguage
Data Metadata Language Tools
copy
TheCLARINdataarchitecture:processing
10
TheCLARINdataarchitecture:contentsearch
11
Language Data Metadata Language
ToolsLanguage
Data Metadata Language Tools
(Federated) Content Search!!
(1) enter query !(4) show aggregated results
Language Data Metadata Language
ToolsLanguage
Data Metadata Language Tools
(2) perform local search
(3) retrieve results
TheCLARINdataarchitecture:workflows
12
Language Data Metadata Language
ToolsLanguage
Data Metadata Language Tools
Web Service Pipelines!!
(1) select input data (2) construct pipeline (3) execute (4) use/analyse output data
Language Data Metadata Language
ToolsLanguage
Data Metadata Language Tools
Interoperability iskey
• to the exhange ofmetadata• to the exchangeformatsfor the outputofanalytic tools• to the optionsfor supporting comparativeresearch
13
CLARIN&Europeana Interoperaility highligths• CLARIN’singestionpipeline(OpenArchivesInitiativeProtocolforMetadataHarvesting(OAI-PMHprotocol))wasextendedtoretrieveasetofselectedcollectionsfromEuropeana andapplytheconversionintheprocess.
• Severalinfrastructurecomponentshadtobeadaptedtoaccommodatethesignificantincreaseintheamountofdatatobehandledandstored.– Currentstatus:
• 775Europeana datasets(e.g.Newspapers)nowfoundintheVLO• 10KaretechnicallysuitableforprocessingwiththeLRS
– Goal:• Morerecordsintheforeseeablefuture
14
Metadataretrievalandconversion:OAI-PMHprotocol• Europeana:
– EDM-structuredEuropeana asRDF/XMLdocuments• CLARIN:
– HarvesterperformsconversionsbymeansofXSLTstylesheetsbyapplyingastylesheetthatconvertstheRDF/XMLdocumentsmetadatatoComponentMetadata(CMD)
– CreationofaCMDprofileforEDMintheCMDIComponentRegistry– implementationofanXSLTstylesheetthatproducesinstancesofthe
correspondingschemaonbasisoftheEDMrecords.– PropertiesaredefinedasCMDelementsintheorderthattheyappearinthe
EDMspecificationwhileobjectorderisbasedonrelevance.– Conceptlinksareassignedtomostcomponentsandelements.– Implementedconversionstylesheet:theheaderinformationandresource
proxies(entitiesrepresentingexternaldocuments)intheresultingrecordareproducedonthebasisofalistofstaticXPathsintheoriginaldocument.
– Therecord’spayloadisproducedmostlybymeansofastraightforwardcrosswalkwherethepropertiesinthedocumentaremappedtoCMDcomponentsorelementsofanequivalentname.
• Testharvestof11selectedmetadatasets:– Totalof3.2millionsuccessfullyretrievedandconverted,schemavalidrecords– Fullharvestandimportofthesizeofthissampletakesroughly48hours
15
Processingpipelineissues
• GenerallackoftechnicalinformationavailableintheprovidedEDM(e.g.themediatypeforlinkedresources)
• Directlinkstomachineprocessable resourcesarecommonlymissing
• LimitedfunctionalityprovidedbythetoolsthatareconnectedtotheLRS(e.g.languagesvariability,resourcetypes,accessibility)
16
Getintouch
https://www.europeana.euhttps://pro.europeana.euhttps://pro.europeana.eu/project/europeana-dsi-3
17