Educational and Scientific reSources Deliverable D4.3...

D4.3 Second Integration and Enrichment Services Prototype

© EEXCESS consortium: all rights reserved page i

EEXCESS

Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources

Deliverable D4.3

Second Integration and Enrichment Services Prototype

Identifier: EEXCESS-D4.3-JR-DIG-Second-Integration-and-Enrichment-Services-Prototype_final.doc

Deliverable number: D4.3

Author(s) and company: Thomas Orgel, Stephan Ehgarter, Werner Bailer, Martin Höffernig, Silvia Russegger (JR-DIG)

Internal reviewers: Roman Kern (KC)

Work package / task: WP4

Document status: FINAL

Confidentiality: Public

Version 2015-10-31


© EEXCESS consortium: all rights reserved page ii

History

Version Date Reason of change

1 2015-03-27 Document created (e.g. structure proposed, initial input…)

2 2015-10-27 Final changes before internal review

3 2015-10-29 Internal review

4 2015-01-31 Final version

Impressum

Full project title: Enhancing Europe’s eXchange in Cultural Educational and Scientific reSources

Grant Agreement No: 600601

Workpackage Leader: Thomas Orgel, Jr-DIG

Project Co-ordinator: Silvia Russegger, Jr-DIG

Scientific Project Leader: Michael Granitzer, Uni-Passau

Acknowledgement: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 600601.

Disclaimer: This document does not represent the opinion of the European Community, and the European Community is not responsible for any use that might be made of its content.

This document contains material, which is the copyright of certain EEXCESS consortium parties, and may not be reproduced or copied without permission. All EEXCESS consortium parties have agreed to full publication of this document. The commercial use of any information contained in this document may require a license from the proprietor of that information.

Neither the EEXCESS consortium as a whole, nor a certain party of the EEXCESS consortium warrant that the information contained in this document is capable of use, nor that use of the information is free from risk, and does not accept any liability for loss or damage suffered by any person using this information.


© EEXCESS consortium: all rights reserved page iii

Table of Contents

1 Executive Summary .................................................................................................................................... 1

2 Introduction ................................................................................................................................................ 3

2.1 Purpose of this Document ........................................................................................................................ 3

2.2 Scope of this Document ............................................................................................................................ 3

2.3 Status of this Document ............................................................................................................................ 3

2.4 Related Documents ................................................................................................................................... 3

3 Problem to solve – User scenario ................................................................................................................ 4

4 Solution Components ................................................................................................................................. 5

5 PartnerWizard ............................................................................................................................................ 8

6 Refinement Mapping Configuration .......................................................................................................... 16

6.1 Output Template Editing ......................................................................................................................... 16

6.2 Context Information Editing.................................................................................................................... 16

6.3 Empty Fields Handling in Mapping Templates ........................................................................................ 17

6.4 Extended Project Configuration .............................................................................................................. 17

7 Enrichment Services & improvements in the PartnerRecommender......................................................... 19

8 Quality Assessment .................................................................................................................................. 20

8.1 State of the Art ........................................................................................................................................ 20

8.2 Assessing Mapping Quality ..................................................................................................................... 21

8.3 Assessing Data Quality ............................................................................................................................ 24 8.3.1 Record statistics....................................................................................................................................................................... 24 8.3.2 Structuredness ......................................................................................................................................................................... 24 8.3.3 Vocabulary accessibility ........................................................................................................................................................... 25 8.3.4 Machine readable rights metadata ......................................................................................................................................... 26 8.3.5 Representation of data quality measurements........................................................................................................................ 26 8.3.6 Implementation ....................................................................................................................................................................... 27 8.3.7 How to use ............................................................................................................................................................................... 30 8.3.8 Outlook .................................................................................................................................................................................... 30

9 EEXCESS Data Model Update .................................................................................................................... 31

9.1 Update of the data model ....................................................................................................................... 31 9.1.1 Europeana ............................................................................................................................................................................... 32 9.1.2 KIM .......................................................................................................................................................................................... 33 9.1.3 ZBW ......................................................................................................................................................................................... 33 9.1.4 Wissenmedia ........................................................................................................................................................................... 34

9.2 Parsers and vocabularies for approximate information ......................................................................... 34 9.2.1 Parsing approximate time and place information ................................................................................................................... 34 9.2.2 Vocabularies ............................................................................................................................................................................ 36

10 Conclusions ............................................................................................................................................... 37


© EEXCESS consortium: all rights reserved page iv

11 References ................................................................................................................................................ 38

12 Glossary .................................................................................................................................................... 40


© EEXCESS consortium: all rights reserved page 1

1 Executive Summary

EEXCESS aims to bring cultural, scientific and educational content to the user. This means that the user gets additional information within the environment the user is using and working with. EEXCESS aims to unfold the treasure of cultural, scientific and educational content to improve the interconnectedness between these different domains.

One important achievement of EEXCESS is to provide not only software components but also working prototypes. One prototype is a Chrome browser plugin that enables to list additional cultural and scientific information while looking at and editing a Wikipedia article.

To get the results of different institutions and therefore different information systems listed the following tasks have to be fulfilled by content provider:

1) Create a mapping of local data to the EEXCESS data format

2) Include the data in the EEXCESS Framework by providing a PartnerRecommender on the local system that is integrated in the Federated Recommender of EEXCESS

To provide tools for both these tasks work package 4 of EEXCESS is foreseen. First versions of features to fulfil these tasks were provided in autumn 2014 during the first testbed and evaluation phase. Especially the testbed phase lead to new awareness regarding the usability and therefore also new features and a new tool was planned and implemented.

The mapping creation process was already an important task, which has been fulfilled and described in D4.2. Part of the first testbed phase was the usage of the ConfigTool – described in D4.2 – to make the mapping process as easy as possible. One result of the testbed and evaluation was that the mapping format has to be improved and slightly changed to meet the most important user needs and wishes. For instance more detailed information regarding location data and date formats were needed to be able to make best advantage of visualisation improvements that were planned in work package 2. So the EEXCESS data format was slightly changed in order to include e.g. date categories to distinguish between find date and creation date. These changes also influenced the ConfigTool in a way that the partner provider must be able make easiest and best use of these categories.

The above listed second task needs to be done to get the data of local system explicitly listed within the EEXCESS features – e.g. the Chrome browser plugin. In order to enhance the performance it is foreseen to have a PartnerRecommender on the local systems that performs the explicit queries coming from the FederatedRecommender. The FederatedRecommender merges the results coming from the local system depending on user profile information. This includes for instance ranking of the results depending on the interests of the user. The FederatedRecommender is not part of this deliverables as it is covered in work package 3.

One feedback from the evaluation phase was that the establishment of a PartnerRecommender seems to be rather complicated as it was based on programming knowledge. Therefore we investigated effort in the conception and implantation of a web based tool that creates a PartnerRecommender rather automatically – the so called PartnerWizard. This work has been conducted jointly with work package 3 and their part is reported in D3.3.

By “simply” enter configuration parameters into a web formula a first PartnerRecommender is built that can be tested against results of the specific partner’s data. The user herself can decide if changes should be done in order to make the result list better. Once the partner is satisfied the PartnerRecommender can be deployed on the development server of the EEXCESS consortium. The shift from the development server to the stable – and so public – server is done by the EEXCESS consortium in order to have control which data are available via EEXCESS and to make sure that the data access agreement is signed by the new partner.

The following description includes the improvement and changes regarding the ConfigTool and is a first “handbook” to get a PartnerRecommender using the PartnerWizard.

Once the data of a partner are accessible via EEXCESS tools we consider it as important to give feedback to the data provider regarding the data quality. Investigate effort in analysing the data files and result lists before and after the mapping process makes it possible to improve the recommendation and influences the ranking of



single records. So work package 4 answers questions like “How can the heterogeneous information from various providers be harmonised to give the user a uniform data access?”, “How can the result list be made better?” or “What has to be done to get my results listed and ranked as well?”

Aim of EEXCESS is not only to list results from local systems – although the listing of federated and ranked results is a first important step – but to enrich this information depending on the content itself. So if a user e.g. is looking for information regarding the “Vorfriede von Leoben”, a peace treaty between Austria and France in the 18

th century he might also be interested in similar events in Leoben, Styria at a similar time. EEXCESS will

enrich the results depending on keywords, structured vocabulary – as for instance location terms in the GeoNames thesaurus- or blog entries. The features of this enrichment process were implanted and included into the recommender system. Results of this enrichment are also described within the present deliverable.



2 Introduction

2.1 Purpose of this Document

This deliverable provides the first prototype for the integration of cultural, scientific and educational data sources as well as the enrichment of those data sources from social media channels.

2.2 Scope of this Document

Deliverable 4.3 will give an overview on the different components of the EEXCESS Framework for data provision. It will also list and describe the necessary tasks that have to be fulfilled by different content provider to get data included within the results EEXCESS provides.

The document describes the development and implementation as foreseen in D1.2 which was based on the results of the first deployment and testbed phase of EEXCESS. Based on a defined use case (see chap. 3 Problem to solve – ) the description in D4.3 should enable a new data provider to include her digitised objects, documents and images into the EEXCESS framework.

2.3 Status of this Document

This is the final version of D4.3. The project internal review was done by KNOW.

2.4 Related Documents

Before reading this document it is recommended to be familiar with the following documents:

D1.1 First Conceptual Architecture and Requirements Definition

D4.1 Integration and Enrichment Specifications and Analysis

D4.2 First Integration and Enrichment Services Prototype

D3.3 Second Federated Recommender Prototype



3 Problem to solve – User scenario

Persona: Helmut J.

Education: programmer

CV: After high school Helmut started working as an IT Administrator for a local museum.

Scenario Description:

The director of the Museum where Helmut works has heard from EEXCESS and wants to bring in the content of his museum into the EEXCESS ecosystem. So he gives Helmut the task to offer the content to EEXCESS. Helmut browses the EEXCESS website and finds different ways to act as a data provider for EEXCESS. The Museum offers a searchable web site with the content of the museum, which also provides access via an API. So he decides to use the PartnerWizard provided by EEXCESS.

Role of EEXCESS:

EEXCESS provides a template which the user can use to create a new PartnerRecommender. EEXCESS provides all necessary libraries in a public repository.

EEXCESS offers a WebApp the so called PartnerWizard to build a new PartnerRecommender using the template to bring in content that is available via an API.

EEXCESS also offers services where the new data provider can hook into the EEXCESS ecosystem and services to optimise the results (work fulfilled in work package 3).



4 Solution Components

To understand how we can integrate the new data provider into the EEXCESS Framework we give a brief overview over the EEXCESS architecture. To explain how the EEXCESS Framework works, we start at the user and her client. For our first use case in EEXCESS the “client” is a Chrome-browser, but this special client can of course be replaced by any other type of client (e.g. learn management system). The following list gives an overview on the different tasks right from the start of a user driven search process to the final result presentation.

The client detects an information need and creates a query.

The client sends this query to the Privacy Proxy. The Privacy Proxy invokes the Federated Recommender. This component has the information which data provider are actually available.

The Federated Recommender calls all the PartnerRecommender

PartnerRecommender sends results back.

The results are integrated in one combined view and are sent back to the client via the Privacy Proxy.

A more detailed description of the components and their interaction is provided in D1.2.

To add a new data provider we need to build a new PartnerRecommender which queries the partner data store and returns the results in the EEXCESS data format to the Federated Recommender.

The PartnerRecommender needs to run on a Apache Tomcat1 or another Java Servlet Container.

The following figure gives an overview on the EEXCESS component as specified and described in detail in D1.2 First Conceptual Architecture and Requirements Definition.

1 http://tomcat.apache.org/



Figure 1: EEXCESS UML component architecture and system boundaries from Figure 3.1 D1.1

Not directly part of the EEXCESS Architecture but important application part especially for data provider is the so called “PartnerWizard”. The PartnerWizard is a stand-alone Web application that offers the possibility to create a PartnerRecommender without being a software development specialist. Input and output specifications that are necessary to have a valid PartnerRecommender are defined via available configuration forms.

From the view of architecture diagrams the application stands outside of the EEXCESS architecture that returns a ready to use partner recommender. The following chapter explains the usage and necessary configuration in detail.



A second application outside the core EEXCESS architecture is the “ConfigTool”. It was already explained in D4.2 and was further developed as specified and foreseen in D4.2. Via the ConfigTool the necessary metadata mapping from the source data format from the data provider repository to the EEXCESS data format can be defined. Output is a transformation file that now is directly used within the EEXCESS PartnerRecommender to be able to display and use the “local” data within the EEXCESS features. Refinement decided upon during the first testbed and deployment phase are described further on.



5 PartnerW izard

In the following section we point out how Helmut J. - our IT Administrator from the use case - can easily create a PartnerRecommender and so provide his data to the EEXCESS framework and his users.

For the success of EEXCESS the number of data provider is a very important measurement. Therefore, it seems critical to have good and easy to fulfil procedures or necessities to get especially data provider from the cultural domain on board. We decided to publish a guide how new APIs from new data provider can be included. This was already done during the testbed and evaluation phase in year 2.

One result of the evaluation of this guide was that the data provider needs a person with implementation skills to get the PartnerRecommender running. Many of the data providers cannot implement this method by their own. In order to become more attractive for long tail content providers, which EXCESS wants to address, we decided to offer a second easier to handle method – the so called PartnerWizard.

With the PartnerWizard we have improved the process of integration by reducing the steps and complexity of this process. The build system we use to build the system for the Server side components of the EEXCESS Framework is Apache Maven

2. With Maven it is possible to build the components only using the source and the

Maven configuration; no additional action is required, like downloading libraries, etc. Maven provides a feature called archetype, which enables the developer to provide templates for a software project. We have created such a Maven archetype which enables the developers to create the structure for the new PartnerRecommender with one single command. This command needs specific parameters to create the PartnerRecommender.

2 https://maven.apache.org/



Figure 2: Maven archetype partner-recommender

Figure 2 shows the structure of the Maven archetype. The sub-folder “src/main/resources/archetype-resources” contains the root of the template. The other files and folders are needed by the maven archetype. In this structure the parameters which the archetype can handle are defined.

During the generating process maven replaces placeholder in file names (surrounded with __) and in files with the values passed from the call of the archetype. Figure 3 shows an example source files with the placeholders and how they are used in the archetype.

Figure 3: maven archetype partner-recommender - source code



All depending libraries are included and were accessed from the public repository.

The source of the archetype is published on Github3. The compiled archetype is published on the public

repository of the EEXCESS partner KNOW Center4.

Further on the next step of improvement was that we created a web-application hosted on our servers where the data provider can use simple web page forms to fill out some specific parameters to create the new PartnerRecommender.

Figure 4 shows a web form to enter some general information needed to build a valid Java project and also information needed for registration into the EEXCESS framework.

3 https://github.com/EEXCESS/PartnerWizard/tree/master/archetype-partner-recommender 4 https://nexus.know-center.tugraz.at/content/repositories/eexcess/

https://github.com/EEXCESS/PartnerWizard/tree/master/archetype-partner-recommender

https://nexus.know-center.tugraz.at/content/repositories/eexcess/



Figure 4: PartnerWizard - general parameters

Figure 5 shows the form to define the URL of the API for the search call. There is also an additional field for an example search term. With the button “Call API Search” the WebApp, generates the service call and executes it. The response will be shown below in the section API response. When the user gets a response he must define the XPath where the single results are located in the response and then define mappings to the defined EEXCESS fields. These mappings must be entered as XPaths, relative to the upper defined XPath for the loop.



There is also a button “test” where the user can probe the entered XPaths on the actual result. The results of this probe will be shown in the right column in this table.

Figure 5: PartnerWizard - parameters for search API



The entered search term will be later used to generate automatic JUnit-Tests in the source project which the WebApp generates.

A similar web form is also available to configure the detail call API. Figure 6 shows the parameters for configuring a detail call to the API to get more detailed data for the objects, to make best use especially in the visualistaiton tools of EEXCESS and in the content creation components (WordPress plugin, Moodle plugin). As search term a unique identifier is required in the field search term. This value must be the corresponding value as defined in the mapping from the search result for the field ID.

Figure 6: PartnerWizard – parameters for detail API

Once the user has finalised the configuration and is satisfied with the result the new PartnerRecommender can be created with the button “Build the PartnerRecommender”. This is shown in Figure 7. If it works well, the



WebApp shows the Maven logs. The WebApp computes the entered parameters to a Maven command show in Figure 7. The user can also execute the command on his own machine.

Figure 7: PartnerWizard - build PartnerRecommender

On the bottom of the screenshot there are two links where the user can download the compiled WebApp and deploy it on any Tomcat server which has access to the the API and to the EEXCESS-Framework. With the other link the user can create the a zipped package of the generated source files and download this. This enables the user to use this source code as a base for further changes or improvements.

The PartnerWizard can deploy the new generated PartnerRecommender to our development environment and it will automatically register at the FederatedRecommender on our development server. So, if the user has downloaded the EEXCESS-Chrome-Extension for the development environment, the user should see recommendations from his data store. The URL for this version of the EEXCESS-Chrome-Extension is mentioned in the WebApp.

Figure 8: PartnerWizard - Deployment



Figure 9: generated sources

Figure 9 shows the structure of the generated sources of the PartnerRecommender.

Figure 10: PartnerWizard - configuration

Figure 10 shows the generated configuration of the PartnerRecommender. The parameter federatedRecommednerURI points to the endpoint of the FederatedRecommender where the new PartnerRecommender registers. In this case its points on the same machine, because the PartnerWizard is deployed on the development server.

The new PartnerRecommender is already configured to connect to the FederatedRecommender on the same server, so that after the PartnerRecommender is deployed via the PartnerWizard and has registered on the FederatedRecommender, the new data source will be used.

For testing the PartnerWizard we have created two new PartnerRecommenders. One for the RijskMuseum and another for a local musem named Museum Kierling. We have also tested the PartnerWizard with an already existing Partner, KIMPortal.

We plan to integrate the PartnerWizard with the Recommender Query Generation WebApp to one WebApp which enables the user to create a WebApp and in a second step to optimise the results with the Know-Center part as described in D3.3.



6 Refinement Mapping Configuration

This section describes extensions of the metadata mapping configuration tool in order to improve the functionality and user experience. The extensions concern support for editing the output structure of a format in the configuration tool, editing metadata in different context (e.g., titles on collection, object or part level), handling empty fields in responses from PartnerRecommenders and new functions for project configuration, i.e., building on existing projects and sharing (parts of) mappings.

6.1 Output Template Editing

A mapping project includes exactly one XML output template. Such a template describes the general output structure of the target format. The mapping instructions which are defined by the metadata mapping configuration tool are inserted in this template as additional XSL templates. Directly editing the output template is now supported by the metadata mapping configuration tool (cf. Figure 11). The corresponding menu entry is “Edit Mappings -> Output Structure”.

Figure 11: Editing the output template.

6.2 Context Information Editing

Editing context information is also supported by the metadata mapping configuration tool (cf. Figure 12). A context represents a specific level of a metadata format. For example, a metadata format includes descriptions about the decomposition of content into series and sub-series. These decomposition levels are represented by different contexts. A context can be associated with a concrete data type representation and is also part of template names in the output template (e.g. Series.UnitID in Figure 11).

A new context is added or an existing context is modified via “Edit Mappings -> Edit Contexts” (cf. Figure 12). In addition, an XPath locator can be attached to each existing context. This locator designates the position of the context in a concrete input metadata document. The XPath information is linked to the output structure via keywords in the output template. For example, during the mapping process the keyword “Context.Series” defined in the output template depicted in Figure 11, is replaced by the XPath

row/field[@name='Verzeichnungsstufe'][text()='Serie']/.. defined in Figure 12.



Figure 12: Editing context information.

6.3 Empty Fields Handling in Mapping Templates

The behaviour of the mapping templates when encountering metadata elements without values to be transformed is also configurable (via “Edit Mapping -> Edit Data Type Mappings”). For each created mapping template it can be selected whether involved metadata elements having empty values should be ignored or not (cf. Figure 13). An extension would be the declaration of default values for such metadata elements. However, this functionality has not been fully implemented yet.

Figure 13: Handling empty fields in mapping templates.

6.4 Extended Project Configuration

When working with the metadata mapping configuration tool practically, it turned out that some desirable options for a detailed project configuration are missing. Only predefined metadata formats can be selected currently. There is no possibility to define a new metadata format. Furthermore it is not possible to copy or to delete an existing mapping project. In addition, it is impossible to designate a master mapping which can be reused in multiple mapping projects. In this context, a master mapping is the mapping from one specific format to the generic concept ontology. To summarize, we propose the integration of the following features for enabling an extended project configuration are:

1. Add a new metadata format

2. Copy an existing mapping project

3. Delete an existing mapping project

4. Set a new master mapping

5. Edit an existing master mapping

6. Reference a master mapping in a mapping project



The availability of these features would lead to an extended project configuration. In Figure 14, the extended project configuration options are represented as UML activity diagram. According to this diagram, the available options are copy or delete a mapping project, add a new metadata format, create a new mapping project, and edit an existing mapping project. In case a mapping project should be copied or deleted, the corresponding action copy project or delete project is performed. A new metadata format is added via the actions set format name and set format namespace. When creating a new mapping project a reference to an existing master mapping is optionally set by the action reference source master mapping for the source format respectively by the action reference target master mapping for the target format. In case an existing mapping project is edited, references to possible defined master mappings are updated first (action update referenced master mappings). Then three editing options are available. The first option is to start the normal mapping creation process (action edit mappings), the second is to edit a referenced master mapping (action edit referenced master mapping), and the last is to set a new master mapping which has been defined in this mapping project (action set new master mapping). In case a master mapping is referenced when choosing the first option, it has to be decided if the master mapping should be unreferenced and referenced mapping definitions are directly integrated in the current mapping project (action unreferenced master mappings) or the mapping definition process should be aborted (action cancel editing).

Figure 14: Extended project configuration options represented as UML activity diagram.

Currently the described features are not available in the metadata mapping configuration tool. However, the integration of these features is projected for the next iteration of our tool.



7 Enrichment Services & improvements in the PartnerRecommender

As pointed out in D4.2 the quality and the performance of the enrichment service were too bad to be activated in the last deployment. In the last period we have worked on these topics.

We have changed the process of the enrichment services, so that we now have fewer service calls. For example we use a different service call from the DBpedia Spotlight service, which takes the whole description of our records as input. This change effects on one side in significant less service calls and also, better matching of the results returned by the service.

We have also removed the Freebase service, because Google has shut down this service. We use instead the DBpedia spotlight service.

We also have made a change in the EEXCESS framework, so that we removed metadata information from the recommend method and shifted it to another new service call - the so named detail call. This call will be used from the clients to get more detailed information about the items. The enrichment service also runs only during the detail calls and not as part of the recommend method. This leads to a better performance of the recommended method and the enrichment services will be only called if the client needs more detailed information to visualise the record. The new detail call gets as an input a list of objects and gets returned the same list with the objects including their metadata. This metadata includes also metadata generated from EEXCESS enrichment services. In order to get acceptable response times, the PartnerRecommender works with a thread pool to process the list of objects in the detail call. In addition we have added a timeout mechanism to prevent a single processing of one record to block the rest. So if the timeout occurs, the PartnerRecommender skips those records which took too long and returns the others.

This new version of the enrichment service is deployed and will be evaluated during this year’s evaluation phase.



8 Quality Assessment

8.1 State of the Art

A detailed review of the state of the art has been part of D4.2 and is thus not repeated in this document. However, assessment of metadata quality has been become a quite active topic in different projects and initiatives dealing with cultural heritage and scientific content (e.g., Europeana defining data quality as one their priorities for the coming five years

5). Thus a summary of recent related work is presented in the following.

One observation from the data quality round table at the Europeana Tech 2015 conference is that the aggregation chain of Europeana is a challenge for data quality, and that quality improvement and enrichment should be done as close to the source as possible. This issue is addressed in the structure of the EEXCESS system, where source data quality assessment and enrichment are performed in the respective partner recommenders, and not after data has been combined. The report also highlights the need for addressing vocabulary mapping, and using open vocabularies (or at least providing mappings to them).

In the context of learning object metadata, [Nikolaos, 2014] propose a process for metadata quality assessment. They tested their process on a sample metadata set, which clearly shows the significant effort for expert assessment of metadata quality.

[Debattista, 2014b] propose an extensible framework for assessing quality of Linked Open Data called Luzzu. One important contribution of their work is a data quality ontology (daQ), which is also used an input to W3C’s work on this topic (cf. Section 8.3). [Trippel, 2014] propose a quality assessment framework using similar criteria as earlier works, but they calculate a single score over all these criteria. A shortcoming is however that the contributions of the individual criteria are weighted by an average calculated on a reference data set, i.e., the representative data set is needed, which is difficult in a setting such as EEXCESS where data providers are expected to be continuously added. A similar approach is described in [Reiche, 2014], also requiring parameters that can only be obtained from a static reference data set. In addition, this work proposed different visualisations of the obtained scores. The work of [Gavrilis, 2015] shares elements with all of these works. They propose an assessment framework called MQEM, but include a set of concrete metrics which yield numeric values. The overall score is determined as a weighted sum of these values.

When assessing the mapping quality of metadata, (already mentioned) quality issues such as completeness and consistency can be considered. For example, in [Geerts, 2014], the notion and semantics for evaluating schema mappings and data repairing are introduced. In addition, processing provenance information enables the processing of further aspects of metadata. For example, when enriching multimedia at multiple stages, related creator information can be modelled or information about reliable and trustworthy can be added. Provenance information is also used to model how data is transformed. A framework for analysing provenance data based on a graph representation is presented in [Cheah, 2014]. As a next step, this framework will also support the W3C PROV data model

6.

Europeana has recently published a report from their task force on metadata quality [Dangerfield, 2015]. There are a number of points that are of related to EEXCESS and highlight the relevance of the quality analysis work done in this project. Some of the points discussed concern the quality of contributed metadata and its assessment. Recommendation 1 asks for more transparent metadata processes, which can be achieved by using provenance metadata, as already implemented in the EEXCESS data model. Section 2.1 of the report advocates assigning URIs referencing terms from relevant online vocabularies, and stresses the importance of using open multilingual vocabularies for enrichment. This aspect is also reflected by checking the availability of vocabularies used in source metadata (cf. Section 8.3). Data providers and aggregators are asked to do more quality checks on data they deliver. This clearly calls for more automation of the quality assessment process.

Another important part of the Europeana metadata quality report discusses mapping (or crosswalk) related issues. Section 1.1 talks about the fact that metadata loss is often inevitable during transformations, and this is

5 http://pro.europeana.eu/page/data-quality-etech15-roundtables 6 http://www.w3.org/TR/prov-primer/



one reason why methods for analysing the quality of mapping results are being developed in EEXCESS. The reports asks for the inclusion of qualifiers such as language tags, which may need to be converted or even inserted during the mapping process, as supported in our mapping configuration tool. Two points concern the sharing of (partial) mappings, and the options to create mapping on the basis of existing similar ones. In order to increase trust in metadata processes, recommendation 3 asks for documenting and sharing metadata crosswalks. Recommendation 12 states that templates shall be used to ensure a certain metadata structure is met. Support for sharing mappings has been added to the metadata mapping configuration tool (cf. Section 6.4), and new mappings can be customised based on such templates.

The report recommends including context information (such as whether a resource is part of a collection). Metadata on different context levels is supported by the EEXCESS mapping approach, however, often missing in the provided data. The report stresses the importance of rights metadata, that indicate the options for reusing the published resources. Similar to the assessing the use of controlled vocabularies for other fields, the EEXCESS quality assessment tools support checking rights metadata for references to known licenses (e.g. Creative Commons).

8.2 Assessing Mapping Quality

A reliable approach to assess the quality of a mapping is to compare a corresponding mapping result with an expert created reference. However, it is infeasible to implement this approach in an environment where a user creates, modifies, and evaluates mappings while expecting immediate mapping quality feedback. For this use case, providing a ground truth mapping result by an expert without delay in the mapping creation process is impossible. In addition, possible valid deviations between a mapping result and the ground truth may not be recognised depending on the complexity of the comparison method.

Our proposed approach for assessing the quality aims to meet the requirements of a tool-based mapping creation process by providing immediate feedback of the mapping quality. This approach is integrated in our metadata mapping configuration tool, which enables the creation of mappings between different XML-based metadata formats. The metadata mapping configuration tool is presented in D4.1. To summarise, the core part of the configuration tool is an intermediate conceptual representation of metadata properties, which serves as a hub for mapping metadata between different formats.

The mapping quality is assessed by performing a round trip mapping of a given metadata document and detecting differences in the original document and the corresponding mapping result. Thus by comparing the presence, absence, and representation of specific metadata properties statements about the mapping quality are made. For example, an absence of metadata properties in the mapping result indicates an impairment of the mapping quality and is a spot for possible improvements in the mapping specification.

The term round trip mapping denotes that a metadata document is first mapped to an intermediate format and then mapped back to the original metadata format. As a consequence the two metadata documents needed for the comparison task (original and mapping result) are represented by the same metadata format. Two different variants of round trip mappings are possible by our metadata mapping configuration tool. The first variant considers the internal conceptual representation of metadata properties of our tool as the intermediate format. In contrast to the first round trip mapping variant, the second variant uses a specific target metadata format as intermediate format. This is a more practical use case since two mappings between different concrete metadata formats are performed: one mapping is performed from metadata format A to metadata format B and another one is performed back from metadata format B to metadata format A.

Depending on the formats involved, the mappings may not be lossless, but due to limitations of one of the formats, some loss of information or imprecision in the mappings may be expected. This needs to be specified by an expert (once per format, not per instance) and must be provided in machine-readable form in order to be accessible by the metadata mapping configuration tool.

For example, the possible mapping paths between concepts of the two specific metadata formats (KIM.Collect and the Dublin Core Metadata Initiative

7) and the conceptual metadata representation of the configuration

7 http://dublincore.org/



tool are depicted in Figure 15. In case of assessing a round trip mapping via the conceptual representation there is no loss of information with respect to the involved metadata elements. For example, the KIM.Collect

metadata elements Autor and Fotograf map to the conceptual metadata elements Author and

Photographer and vice versa. In addition, there is a mapping path from these KIM.Collect metadata

elements back to the element Hersteller (Producer). This would lead to loss of information since

Hersteller is not exactly the same as Autor or Fotograf. However, in case a lossless mapping option is available, a possible more general mapping will not be performed by the configuration tool. On the contrary, when considering the Dublin Core metadata elements for the round trip mapping, a loss of information has to be accepted. Here the round trip mapping from the elements Autor and Fotograf to the Dublin Core

element Creator and back to the KIM.Collect element Hersteller is the only available option.

KIM.Collect Metadata Mapping Configuration Tool

Conceptual Representation

Dublin Core

Author

Creator

Autor

CreatorHersteller

Fotograf Photographer

Figure 15: Mapping paths between different metadata elements.

In order to check differences after performing the round trip mapping, statements about corresponding XPath locators of a metadata format are expressed. These locators are used to detect differences in a given metadata document before and after performing a round trip mapping. Therefore corresponding XPath locators are

grouped and expressed using classes and properties of the newly defined comparison ontology (co).

Class co:Locator-Association is the main class of this ontology. Two sub classes of this class are defined to express locator associations for the round trip variant via meon concepts only (class

co:LocatorAssociationMeon) and variant using a specific target format (class

co:LocatorAssociationTarget). Corresponding XPath locators are described by properties

co:hasInputXPath (before mapping) and co:hasOutputXPath (after mapping).

In Figure 16, the corresponding XPath locators for assessing the concept Autor (cf. Figure 15) for the two round

trip mapping variants are expressed. LC_1 is an instance of class co:LocatorAssociationMeon, while

LC_2 is an instance of class co:LocatorAssociationTarget. Since the mapping variant via meon concepts only is lossless, the XPath locator does not change after performing the mapping. Thus the properties co:hasInputXPath and co:hasOutputXPath refer to the same value (/objects/object/Autor). In contrast to this mapping variant, the round trip mapping via a target format can be lossless. Locator association LC_2 refers to an acceptable loss of information after mapping. Here the values of the properties

co:hasInputXPath and co:hasOutputXPath are different.

After all relevant locator associations of a metadata format are specified, the mapping quality assessment is performed. Based on comparison of the locator associations, missing values for elements and attributes in a given metadata document after performing a mapping are detected and reported. Again, such a comparison result is expressed by the comparison ontology. In Figure 17, an example of a comparison result is shown.

MC_1 is an instance of class co:MappingComparison. This instance refers to LC_2 (an instance of class

co:LocatorAssociation ) via the property co:hasLocatorAssociation. In addition, the

comparision result of this locator association is described by MC_1 using the property co:hasResult. Since one XPath locator can address multiple elements in an XML document, the involved XPath locators are refined in order to refer to one specific element involved. These locators are represented using the properties

co:hasUniqueInputXPath und co:hasUniqueOutputXPath. For example, the XPath



/objects/object/Autor can address more than one Autor elements, while the XPath

/objects[1]/object[1]/Autor[1] addresses only one distinct Autor element.

LC_1

co:Locator

Association

/objects/object/

Autor

co:hasInputXPath

co:hasOutputXPath

co:Locator

Association

Meon

rdf:type

co:Locator

Association

Target

rdfs:subClassOf

LC_2

/objects/object/

Autor

co:hasInputXPath

rdf:type

/objects/object/

Hersteller

co:hasOutputXPath

Figure 16: Example of representing corresponding XPath locators when performing a round trip mapping.

MC_1

co:Mapping

Comparison

rdf:type

LC_2

co:hasLocator

Association

co:hasResult

Mapping ok

co:hasUnique

InputXPathco:hasUnique

OutputXPath

/objects[1]/object[1]/

Autor[1]

/objects[1]/object[1]/

Hersteller[1]

rdf:type

co:Locator

Association

Figure 17: Example of representing a mapping comparison.

The presented mapping quality assessment approach is integrated in the metadata mapping configuration tool. As described, two different assessment variants are implemented: One uses only the internal conceptual representation of metadata properties of the configuration tool for performing a round trip mapping while the second employs a specific target metadata format. The first variant should be lossless in the normal case while in the second some loss of information or imprecision in the mappings may be expected based on the expressiveness of the formats involved. Related locator associations are created, edited, and deleted in the mapping tool (cf. Figure 18). In addition, the results of the mapping quality assessment are presented (cf. Figure 19). In this view the desired round trip mapping variant is applied on a given metadata document and the defined locator associations are evaluated. The presence of these elements indicates a correct mapping process. The result of this evaluation can be a trigger to redefine the mapping instructions using the metadata mapping configuration tool.



Figure 18: Defining locator associations in the metadata mapping configuration tool.

Figure 19: Presenting the result of the mapping quality assessment in the metadata mapping configuration tool.

8.3 Assessing Data Quality

The approach for assessing data quality at input and after enrichment is an extension of the set of quality checks described in [Bellini, 2013]. In the EEXCESS system we cannot assume a single application profile (i.e., a single definition from one organisation of how a metadata format is used), against which completeness and accuracy can be checked, but the profile to be applied will depend on the type of content and the data provider (using the knowledge about the respective native data model).

In the remainder of this section, we describe the implemented quality measures, we describe the representation of the quality measurements and we present some results of applying them to data gathered during the testbed. For assessing the impact of the enriched input data, the same measures are applied, and the differences between the measurement before and after enrichment are analysed.

8.3.1 Record statistics

This set of measures concern basic quality metrics as described in [Bellini, 2013], such as counting the number of records provided, the number of empty records, and normalising these numbers w.r.t. the number of records in the data from different providers.

8.3.2 Structuredness

We determine measure about the structuredness of values, for example of fields containing dates, names or dimensions of objects. The aim is not only to make a binary decision whether they are structured, but also whether the format the field can be inferred (e.g., using regular expressions).

In order to determine if a field is structured we extract a regular expression from all values of the respective field in a data set.

In Table 1 we show a small, but typical snippet of a data set containing structured values following a common pattern.

Table 1: data snippet with structuredness

Time of origin Start time of origin

End time of origin

Height Width

1902 1902.0000 1902.0000 43.0cm 2.5cm

1868 1868.0000 1868.0000 35.0cm 1.7cm

2002 21.0cm 0.5cm



1904 1904.0000 1904.0000 47.0cm 2.7cm

1869 1869.0000 1869.0000 35.0cm 1.7cm

1870 - 1871 1870.0000 1871.0000 34.5cm 3.0cm

1872 - 1873 1872.0000 1873.0000 40.0cm 4.0cm

1874 - 1875 1874.0000 1875.0000 40.5cm 5.0cm

1876 - 1877 1876.0000 1877.0000 40.5cm 5.6cm

1878 - 1879 1878.0000 1879.0000 42.0cm 5.5cm

1880 - 1881 1880.0000 1881.0000 40.5cm 4.8cm

1882 - 1883 1882.0000 1883.0000 41.0cm 4.5cm

1884 - 1885 1884.0000 1885.0000 40.5cm 5.5cm

1886 - 1887 1886.0000 1887.0000 41.0cm 5.0cm

1888 - 1889 1888.0000 1889.0000 41.5cm 5.0cm

1890 - 1891 1890.0000 1891.0000 44.0cm 6.0cm

1892 1892.0000 1892.0000 44.3cm 2.5cm

1893 1893.0000 1893.0000 43.8cm 2.5cm

A large part of the related literature relies on labeled training set to learn regular expressions. This is not practical for our application, thus we only consider approaches that allow unsupervised extraction of regular expression. The algorithm needs to work only on positive data, however, the positive samples may be noisy. Such an approach with the application to inferring DTDs from XML documents has been proposed in [Bex, 2010]. [Fernau, 2005] proposes an algorithm which serves our purpose well, starting from grouping characters by type and then inferring a tree structure of a regular expression. [Li, 2008] propose an approach that at is guided by a prototype regular expression. This can be useful for some types of fields, where assumptions about possible patterns can be made. [Bartoli, 2012] propose an approach based on genetic programming, however, it may be computationally too expensive for our application.

Our approach contains several steps, which aim to speed up the process by performing extraction of regular expressions only if there is a high likelihood for actually having a common structure in the field values.

We preprocess field values by pruning white spaces. As a first measure, we determine a histogram of field lengths over the data set. Data with well-defined structure (e.g., dates, lengths) will show clear peaks in the histogram, which is an indicator for the structuredness of the field. We then detect characters such as hyphens, commata and periods which are also indicators for structured data. We also scan the fields for the presence of SI unit abbreviations

We then implement the first step of the approach described [Fernau, 2005], i.e., block-wise grouping and alignment. In this step, each character or digit gets replaced by an indicator of its type. Let K be the set of different patterns detected, and f(ki) be the frequency of each pattern. We can then determine the structuredness measure as the normalized discrete entropy H(K)/(Σ|K| f(ki)).

We have created a prototype implementation of our approach and use some sample data sets from our actual data providers. The approach will be verified in the upcoming testbed.

8.3.3 Vocabulary accessibility

We analyse whether references to controlled vocabularies using URIs are used in the data. For interoperability and linking with other data is important that the terms of the vocabulary are accessible, i.e., are identified with URLs that can be resolved. Ideally, this URL does not only resolve to a human readable description, but to a machine readable definition, which can be used to relate the term to other data sources. We use content negotiation to perform this analysis. The basic idea of content negotiation is to serve the best variant for a resource, and to serve it based on:

- What variants are available, and what variants the server may prefer to serve



- What the client can accept, and with which preferences: in HTTP, this is done by the client which may send, in its request, Accept headers (Accept, Accept-Language and Accept-Encoding), to communicate its capabilities and preferences in Format, Language and Encoding, respectively

8.

Agent-driven negotiation is realised by analysing the response of the server after receiving an initial request to the resource. We analyse the possibilities of content negotiation regarding the URIs in the dataset. The idea is to use agent-driven negotiation to gather information about the variants a server can serve a resource behind an URI. Possible Variants may be e.g. RDF/OWL, XML, JSON or plain text.

8.3.4 Machine readable rights metadata

The approach for vocabulary quality assessment can also be applied to address some quality aspects of rights metadata, in particular, to determine if rights statements contain only free text or references to machine readable licenses. Similar to vocabularies, it can be checked if the license statement is accessible. In addition, it can be checked whether the license statement is one from a set of known statements, such as from the Creative Commons family of licenses.

8.3.5 Representation of data quality measurements

We need to represent the results of quality assessment in a well-defined and machine-readable way, as we aim at automatically comparing assessment results at different points in time (as the data being assessed is dynamic due to the structure of the EEXCESS system) and at different points in the workflow (e.g., input data quality with quality after enrichment).

We make use of the Data Quality Vocabulary (DQV) [DQV, 2015] currently under development at the W3C, which builds on DaQ [Debattista, 2014]. Note that this specification is only in a working draft stage, thus it does not yet cover all aspects that may be needed, but using the specification serves to validate the proposed model and develop it further. The model allows defining data sets and particular distributions of them (e.g., snapshots sampled at a certain point in time), as well as quality metrics. Quality measures, defined as specialisations of observations from DaQ, describe the result of applying a certain metric to a certain distribution of a data set.

DQV covers the basic requirements in EEXCESS, and integrates smoothly with W3C PROV for representing provenance information. The proposed model has been found to be compact and intuitive to use. There are a number of issues that were found to be incomplete in the current draft for the application in EEXCESS. For metrics, daq:expectedDataType which points to a simple data type may be a bit narrow. For example, in our assessment framework we use metrics that apply to entire records. daq:requires is intended to point to external data needed by the metric (e.g. ground truth), but it may be useful to have also a mechanism to indicate that metrics depend on other metrics (e.g. aggregate results of metrics). Some metrics may need multiple outputs, for example, statistics of metrics (minimum, maximum, mean, median) may need to be expressed. Some metrics may also need input parameters (e.g., weights, thresholds) and for some metrics a binary value of whether a check with this metric is considered passed or not may be useful. The data model defined by the EBU Quality Control programme (EBU QC

9), used for assessing quality of audiovisual media, may

serve an example for these extensions. The quality dimensions defined by DQV seem mostly appropriate. Only mapping quality metrics cannot be adequately represented, but may be seen as a specialisation of the accuracy dimension. However, nesting dimensions seems not to be supported in the current model.

An example document with quality measurements is shown in Figure 20.

8 https://www.w3.org/blog/2006/02/content-negotiation 9 https://tech.ebu.ch/qualitycontrol



<rdf:RDF

xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#

xmlns:prov=http://www.w3.org/ns/prov#

xmlns:eexdaq="http://eexcess.eu/ns/dataquality/daq/"

xmlns:dqv=http://www.w3.org/ns/dqv#

xmlns:dct=http://purl.org/dc/terms/

xmlns:dcat=http://www.w3.org/ns/dcat#

xmlns:daq="http://purl.org/eis/vocab/daq#">

<daq:Metric rdf:about="eexdaq:metric#numberOfRecords"/>

<daq:Metric rdf:about="eexdaq:metric#meanFieldsPerRecord"/>

<daq:Metric rdf:about="eexdaq:metric#minFieldsPerRecord"/>

<daq:Metric rdf:about="eexdaq:metric#maxFieldsPerRecord"/>

<daq:Metric rdf:about="eexdaq:metric#meanNonEmptyFieldsPerRecord"/>

<daq:Metric rdf:about="eexdaq:metric#meanNonEmptyFieldsPerDatafieldsPerRecord"/>

<dcat:Dataset rdf:about="eexdaq:dataset#2015-10-14-17-30-26">

<dct:title>My EEXCESS dataset</dct:title>

<dcat:distribution>

<dcat:Distribution rdf:about="eexdaq:dataset#KIMCollectDistribution2015-10-14-17-30-26">

<dct:title>My EEXCESS KIMCollect dataset </dct:titl

<prov:wasGeneratedBy rdf:resource="eexdaq:dataprovider#KIMCollect"/>

</dcat:Distribution>

</dcat:distribution>



</dcat:Dataset>

<dqv:QualityMeasure rdf:about="eexdaq:metric#numberOfRecordsKIMCollect2015-10-14-17-30-26">

<daq:value rdf:datatype="http://www.w3.org/2001/XMLSchema#double">13</daq:value>

<daq:computedOn rdf:resource="eexdaq:dataset#KIMCollectDistribution2015-10-14-17-30-26"/>

<daq:metric rdf:resource="eexdaq:metric#numberOfRecords"/>

</dqv:QualityMeasure>

<dqv:QualityMeasure rdf:about="eexdaq:metric#meanFieldsPerRecordKIMCollect2015-10-14-17-30-

26">

<daq:value rdf:datatype="http://www.w3.org/2001/XMLSchema#double">33</daq:value>

<daq:computedOn rdf:resource="eexdaq:dataset#KIMCollectDistribution2015-10-14-17-30-26"/>

<daq:metric rdf:resource="eexdaq:metric#meanFieldsPerRecord"/>

</dqv:QualityMeasure>



</rdf:RDF>

Figure 20: Data quality vocabulary example.

8.3.6 Implementation

We have built a prototype10

which uses data logged during calling the services from different data providers. Most services provide the data in XML. If the service provides the data in JSON, the data are transformed to XML by an internal service of the prototype. In particular, we analyse the input data and records, the number of returned fields per record, empty and non-empty fields. We also count mean links per record per provider and analyse which of the URIs in the dataset are reachable. As a result of the analysis the prototype generates statistics of the measured values and also generates charts.

In the remained of this section, we report about the implementation details and results for the record statistics and vocabulary quality metrics.

10

Source code published at https://github.com/EEXCESS/data-quality

http://www.w3.org/1999/02/22-rdf-syntax-ns

http://www.w3.org/ns/prov

http://www.w3.org/ns/dqv

http://purl.org/dc/terms/

http://www.w3.org/ns/dcat

https://github.com/EEXCESS/data-quality



Record statistics

For testing our prototype we use a randomly selected subset data containing over 6,000 records from six data providers. Some data providers include only metadata fields in their service response, if a value for the actual object is present. That is the reason why we calculate the mean value of submitted metadata fields per record for each data provider. The mean number of returned metadata fields varies between seven and 33 per record.

In Figure 21 we show the graph of data fields per record for each data provider. Figure 22 shows the mean non empty data fields per record for each provider in respect to their metadata fields overall. This visualises the filling degree of the record of a data provider.

Figure 21: Input data quality - mean data fields/record

Figure 22: Input data quality - mean non empty data fields/record normalized

Vocabulary quality

We also analysed the use of controlled vocabularies of the data providers involved. Most of the used vocabularies have been created by the data providers but are publicly accessible. Records from those data providers which use non-public vocabularies must be translated to other vocabularies during mapping.

We collect the number of terms identified by URIs, and analyse whether the URIs are actually URLs resolving to a definition of the resource. In addition, we analyse the types of the resources provided, i.e., whether there is only a human readable description or a machine readable description is provided.

Figure 23 shows the mean number of vocabulary terms per record for each provider. The records from ZBW do contain references to controlled vocabularies, but not in the form of URIs, thus they have not been considered in this analysis. The data from DDB contained only few URIs.



Figure 23: Input data quality - number of mean links per record per provider.

Furthermore the accessibility of the URLs used to identify vocabulary terms in the dataset is also checked. The following table shows that almost all URIs in the testbed-dataset are accessible. The number of URLs that are not accessible may vary from run to run of the analysis tool depending on the availability of the servers, although in most cases either an entire vocabulary is accessible or not.

Table 2: Total links vs. reachable links in dataset

provider #URLs #accessible URLs Difference (links not accessible )

KIMCollect 1188 1186 2

ZBW 0 0 0

Europeana 9905 9905 0

Wissenmedia 808 808 0

Mendeley 1542 1542 0

DDB 7 0 7

Regarding the figures for ZBW it is to be mentioned that the vocabularies that are used within ZBW are not linked to URIs that a publically available at the moment. Further on an improvement could be to include URIs at ZBW’s side so that these data might be used for quality check.



Figure 24: Input data quality – Content Negotiation.

Figure 24 shows the possible variants of serving a resource for a DBpedia-URI. The desired information can be found in header-response with header name “Link”. This information is not always available e.g. for URIs from provider KIMCollect the server does not provide this information.

Regarding input data quality we also determine the hosts where the information of the possibilities a server may answer is available and where not.

8.3.7 How to use

The analysis is performed calling a Java application. Source can be found at Github repository11

. The application can either be run as JUnit test within Eclipse (testAppInputTestbedRandom100) or can also be exported as runnable JAR file e.g. named as dataquality-demo.jar and therefore be run as Java application from command line:

java -jar dataquality-demo.jar .\resources\input-testbed\random-100

It generates amongst others the mentioned bar graphs as well as csv-files and RDF/XML files containing information for further processing.

The quality analysis results can be visualised, e.g. as bar graphs. The input to this visualisation are the measurement results using DQV. The visualisation is generated using an XSL transform, which creates an HTML page using jqPlot diagrams from the RDF/XML representation.

8.3.8 Outlook

In the remaining time of the project, extensions of the metadata metrics will be implemented. For vocabularies and licenses, a management for white lists of trusted resources will be introduced. This will provide more flexibility for handling trusted resources, and also speed up the quality checks for records referencing these trusted resources. Counting vocabularies from a list of trusted host will allow determining the reliability of the referenced data. This will also allow getting more detailed statistics about cases where different vocabularies are hosted on the same server.

The other aspect will concern assessment of enrichment results, both by comparing measurements between input and enriched data records, as well as by implementing specific metrics for enriched data, aiming at the consistency of the enriched metadata. This will include handling of cases where vocabularies are already referenced, but just by term IDs and not by complete URIs.

11 https://github.com/EEXCESS/data-quality



9 EEXCESS Data Model Update

The Europeana Data Model (EDM), which forms one pillar of the EEXCESS data model, has evolved together with the progress in implementing Europeana’s backend and contribution technologies. The evolution process of EDM and recent extensions are described in [Charles, 2015]. Some of these, such as support for representing collections and more detailed rights metadata are of high interest to some of the EEXCESS use cases. Another document [Charles, 2015b] outlines development plans for EDM in the near future. This includes identification of intermediate providers (which will help to get better provenance metadata), representation of annotations (using the W3C Web Annotation standard, which is also being used in EEXCESS) and providing metadata for enabling re-use of content (a point that has been raised by EEXCESS test users, but cannot be solved in EEXCESS).

Apart from these issues, the development of prototypes and demonstrators in EEXCESS has shown that the need for some refinements in the EEXCESS data model. This issues concern two aspects of time and place representations:

Grouping time and place: As inherited from EDM, the EEXCESS data model support properties for describing time and place of an object, such as dc:date, dcterms:created or dcterms:spatial. Although time and place information is rather basic and sparse in many records received from providers, there may be the need to not only qualify time and place properties (e.g., time/place created), but also link them. This calls for the concept of event, as for example used in CIDOC CRM. During the evolution of EDM, edm:Event has been introduced, which can be used to address this issue.

Two further issues are approximate time and place representations, which are frequently found in records (e.g. “Southern Germany”, “about 1800”, “Early Renaissance”). For visualisation it is important to give users an idea about whether points on a timeline or a map are precise or just approximations. If the visualisation tool is aware of the uncertainty, it can display data appropriately. A related problem (though not a data modelling one) is parsing the large variety of textual (and potentially multilingual) representations of such approximate time and place specifications.

The following sections discuss the update of the EEXCESS data model and the tools and vocabularies for dealing with approximate time and place information.

9.1 Update of the data model

In addition to the date and time related data properties, the EEXCESS data model now makes use of the following object properties defined in EDM [EDM, 2014]: edm:TimeSpan for time periods with start and end times, edm:Place with latitude and longitude (as WGS84 coordinates) and edm:Event, linking time and place when an event relevant to the provenance of the object occurred. The properties of these three classes (as defined in the Europeana core EDM resource

12) include skos:label and skos:prefLabel, which can be used to

reference controlled vocabularies in addition to (or in place of the concrete numeric properties). This mechanism enables referencing one or multiple concepts for periods or regions, providing semantics to approximate time/spatial ranges that may otherwise seem arbitrary. The updated part of the EEXCESS data model is shown in Figure 25. In addition, we adopt the revised hierarchy of relations defined by EDM (see Figure 26).

12 https://github.com/europeana/corelib/wiki/EDMObjectTemplatesProviders



edm:TimeSpan

ore:Proxy

edm:Event

edm:Place

edm:occurredAt

edm:happenedAt

edm:wasPresentAt

edm:TimeSpan

dcterms:created

edm:Place

edm:currentLocation

dcterms:created

wgs84_pos:lat

wgs84_pos:long

Figure 25: Part of the data model showing the time/place entities.

dcterms:spatial

edm:hasMet

dc:coverage

rdfs:subPropertyOf

rdfs:subPropertyOf

dc:relation

rdfs:subPropertyOf

dcterms:temporal

edm:current

Location

rdfs:subPropertyOf

dcterms:created

dc:date

rdfs:subPropertyOf

dcterms:issued

More specifc

More generic

Figure 26: Revised hierarchy of temporal relations (as defined in EDM specification).

We have updated the mappings for different data providers, making use of these properties. In the remainder of this section we describe some examples from different EEXCESS data providers, showing an XML snippet of the record returned from the Partner Recommender and a diagram of the representation in the EEXCESS data model.

9.1.1 Europeana

The example contains a place with geo-coordinates.



<edmPlaceLatitude>

<e>47.06667</e>

</edmPlaceLatitude>

<edmPlaceLongitude>

<e>15.45</e>

</edmPlaceLongitude>

ore:Proxy

edm:Event edm:happendAt edm:Place

47.07wgs84_pos:lat

15.45wgs84_pos:long

edm:wasPresentAt

9.1.2 KIM

The example contains a place with geo-coordinates and the data of the find.

<arr name="koordinaten_wgs84_0_wgs84_koordinate">

<double>47.5581188754631</double></arr>

<arr name="koordinaten_wgs84_1_wgs84_koordinate">

<double>7.59142116475896</double></arr>

<date name="datum_fund">1887-11-23T00:00:00Z</date>

ore:Proxy

edm:Event

edm:occurredAt

edm:wasPresentAt

edm:TimeSpan 1887-11-23edm:begin

edm:Event edm:happendAt edm:Place

47.558wgs84_pos:lat

7.591wgs84_pos:long

edm:wasPresentAt

edm:hasType

Dbpedia:Excavation

_(archaeology)

9.1.3 ZBW

The record contains a year annotation, and date/time when the record was created. The time of the record created in the system is not relevant for the purpose of EEXCESS and is thus not mapped.



<date>2013</date>

<econbiz_created>2013-12-16T10:22:12Z</econbiz_created>

o re :P ro x y

e d m :E v e n t

e d m :o c c u rre d A t

e d m :w a s P re s e n tA t

e d m :T im e S p a nd c te rm s :c re a te d 2 0 1 3 -0 1 -0 1e d m :b e g in

2 0 1 3 -1 2 -3 1

e d m :e n d

9.1.4 Wissenmedia

The record contains a time annotation with a start and end date.

<tempstart-iso8601>1749-08-28</tempstart-iso8601>

<tempend-iso8601>1832-03-22</tempend-iso8601>

ore:Proxy

edm:Event

E63 Begin of

Existance

edm:occurredAt

edm:wasPresentAt

edm:TimeSpandcterms:created 1749-8-28edm:begin

1832-3-22

edm:wasPresentAt

edm:Event

E64 End of

Existance

edm:occurredAt edm:TimeSpanedm:end

9.2 Parsers and vocabularies for approximate information

For bringing the extensions of the model to life, tools for parsing approximate time and place from different data providers are needed, and vocabularies to represent time periods and types of events are needed.

9.2.1 Parsing approximate time and place information

Approximate time and place information comes in two basic flavours: as terms (or references to vocabularies) or by specifying the approximate range/region in place. The first one is simple to parse, but there may not be an unambiguous and generally accepted mapping to data or coordinates. The main issue with the second is that there may be a wide range of textual representations, which are also language dependent. Thus, we decided to make use of an existing parsing library.

HeidelTime13

is a multilingual, cross-domain temporal tagger. It extracts temporal expressions from documents. It is able to generate TimeML

14 annotated data from input data. TimeML is a markup language for temporal

and event Expressions. This output is then used to represent the time information appropriately in the EEXCESS data model.

13 http://heideltime.ifi.uni-heidelberg.de/heideltime 14 http://www.timeml.org



We analysed the output from HeidelTime with possible date/time values taken from input-data from different providers recorded during the testbed.

Provider ZBW

Input Extracted temporal expressions

Output as TimeML

<date>2013</date> 2013

<date>c 2014</date> 2014

<econbiz_created> 2014-05-16T10:06:14Z </econbiz_created>

2014 05

Provider KIMPortal


Output as TimeML

<date name="datum_fund"> 1887-11-23T00:00:00Z </date>

1887 11

<Beschreibung fieldId="129">

Kalenderbild Dezember 1963 Brauerei ZIEGELHOF, farbiger Druck auf Karton; Gemälde,

signiert:

Fritz Pümpin 62 (unten links);

Gleiches Motiv wird auf einem Kalendebild 30 Jahre später nochmals verwendet.

</Beschreibung>

Dezember 1963

30 Jahre später



Provider Wissenmedia


Output as TimeML

<tempstart-iso8601> 1749-08-28 </tempstart-iso8601>

1749-08-28

The tables show the output from HeidelTime with exemplary input data. It can be seen that years can be recognized but timestamps not.

9.2.2 Vocabularies

There are several vocabularies, which can be used for representing time periods. Semium.org has defined a vocabulary for historical periods. However, the originally published version is no longer available, only a copy mirrored by Europeana

15. The EDM TimeRange documentation also mentions Borys' time periods

16, which

suffers from the same problem. DBpedia defines two classes that are relevant for defining time periods: PeriodOfArtisticStyle

17 and HistoricalPeriod

18. All instances of these two classes are useful terms for describing

time spans.

A related issue is vocabularies for historic place names. The LoCloud project has provided a historical place names service

19, which can be used to link current place names, which can be easily geo-referenced, with

historical names, which may be found in metadata.

15 https://github.com/europeana/tools/blob/master/annocultor_solr4/converters/vocabularies/time/time.historical.rdf 16 http://annocultor.eu/time/ 17 http://dbpedia.org/ontology/PeriodOfArtisticStyle 18 http://dbpedia.org/ontology/HistoricalPeriod 19 http://support.locloud.eu/tiki-index.php?page=LoCloud+Historical+Placenames+Microservice



10 Conclusions

With the PartnerWizard we have a new easier way to integrate new data provider into the EEXCESS framework. This work is not finished yet, but tests during the development showed that it’s much easier to create new PartnerRecommender with the PartnerWizard. Further improvements are planned and also the integration with the WebApp for the optimisation for the QueryGeneration is foreseen. After finishing the PartnerWizard we expect a much easier way to integrate new data provider to the EEXCESS framework.

Therefore, it will become much easier to invite long tail content provider that can’t effort implementation work by their own. With this tool dissemination especially in the cultural domain of the long tail will certainly increase. We already started discussions with some important national archives in Austria.

During the usage of the data from our data provider with the EEXCESS data model some improvements of the data model e.g. more detailed information about time and geo information were identified. We showed how we can solve this without violating the constraints of the EEXCESS data model and reuse existing data models.

Regarding the quality assessment we have added more and detailed information about the input data quality, so we now have an approach how we want to detect structuredness in the input data.

Regarding the enrichment services and improvements especially the obstacle of low performance using the enrichment services within the EEXCESS framework could be overcome by outsourcing of the execution of these procedures.

Having a close look on the HeidelTime and making intensive tests with our input data showed that the features are not feasible for our approach at the moment as the results are not satisfying for EEXCESS.

The approach used in the metadata extraction and enrichment services was adjusted and first tests have shown that the quality of the enrichment results and response times are better.



11 References

[Bartoli, 2012] Bartoli, Davanzo, De Lorenzo, Mauri, Medvet, Sorio, Automatic Generation of Regular Expressions from Examples with Genetic Programming, ACM Genetic and Evolutionary Computation Conference (GECCO), 2012, Philadelphia (US)

[Bex, 2010] Geert Jan Bex, Wouter Gelade, Frank Neven, and Stijn Vansummeren. 2010. Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data. ACM Trans. Web4, 4, Article 14 (September 2010)

[Charles, 2015a] Valentine Charles and Antoine Isaac. Enhancing the Europeana Data Model (EDM), Technical Report, Jun. 2015, http://pro.europeana.eu/publication/enhancing-the-europeana-data-model-edm

[Charles, 2015b] Valentine Charles and Antoine Isaac. Europeana – Core Service Platform, MS29: EDM development plan, Deliverable, Sept. 2015, http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/Europeana_DSI/Milestones/europeana-dsi-ms29-edm-development-plan.pdf

[Cheah, 2014] You-Wei Cheah and Beth Plale. Provenance quality assessment methodology and framework. J. Data and Information Quality, 5(3):9:1{9:20, December 2014.

[Dangerfield, 2015] Marie-Claire Dangerfield and Lisette Kalshoven (eds.). Report and Recommendations from the Europeana Task Force on Metadata Quality, May 2015, http://pro.europeana.eu/files/Europeana_Professional/Publications/Metadata%20Quality%20Report.pdf

[EDM, 2014] Definition of the Europeana Data Model v5.2.6, Dec. 2014

[Fernau, 2005] H. Fernau, Algorithms for learning regular expressions, in: S. Jain, H.-U. Simon, E. Tomita (Eds.), Algorithmic Learning Theory ALT 2005, volume 3734 of LNCS/LNAI, 2005, pp. 297–311.

[Gavrilis, 2015] D. Gavrilis, D.-N. Makri, L. Papachristopoulos, S. Angelis, K. Kravvaritis, C. Papatheodorou, P. Constantopoulos, Measuring Quality in Metadata Repositories, Research and Advanced Technology for Digital Libraries, 2015.

[Geerts, 2014] Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. Mapping and cleaning. In Proceedings of the 30th International Conference on Data Engineering, ICDE 2014, pages 232{243, 2014.

[DQV, 2015] Riccardo Albertoni, Christophe Guéret and Antoine Isaac (eds.), Data Quality Vocabulary, W3C First Public Working Draft 25 June 2015, http://www.w3.org/TR/2015/WD-vocab-dqv-20150625

[Debattista, 2014a] Debattista, J., Lange, C., & Auer, S. (2014). daQ, an Ontology for Dataset Quality Information. Linked Data on the Web (LDOW).

[Debattista, 2014b] J. Debattista, S. Londoño, C. Lange and Sören Auer, LUZZU – A Framework for Linked Data Quality Assessment, CoRR, abs/1412.3750, http://arxiv.org/abs/1412.3750, 2014.

[Li, 2008] Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular expression learning for information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 21-30.

[Nikolaos, 2014] Metadata quality in learning object repositories: a case study, P. Nikolaos, M. Nikos and Sanchez-Alonso Salvador, The Electronic Library, 32(1):62-82, 2014.

[Reiche, 2014] K. J. Reiche, I. Schieferdecker and E. Höfig, Assessment and Visualization of Metadata Quality for Open Government Data. Proc. International Conference for E-Democracy and Open Government, 2014.

http://pro.europeana.eu/publication/enhancing-the-europeana-data-model-edm

http://pro.europeana.eu/publication/enhancing-the-europeana-data-model-edm

http://pro.europeana.eu/files/Europeana_Professional/Publications/Metadata%20Quality%20Report.pdf

http://pro.europeana.eu/files/Europeana_Professional/Publications/Metadata%20Quality%20Report.pdf

http://www.w3.org/TR/2015/WD-vocab-dqv-20150625

http://www.w3.org/TR/2015/WD-vocab-dqv-20150625

http://arxiv.org/abs/1412.3750



[Trippel, 2014] T. Trippel, D. Broeder, M. Durco and O. Ohren, Towards Automatic Quality Assessment of Component Metadata, Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavik, IS, 2014.



12 Glossary

Terms used within the EEXCESS project.

Partner Acronyms

JR-DIG JOANNEUM RESEARCH Forschungsgesellschaft mbH, AT

Uni Passau University of Passau, GE

Know Know-Center - Kompetenzzentrum für Wissenschaftsbasierte Anwendungen und Systeme Forschungs- und Entwicklungs Center GmbH, AT

INSA Institut National des Sciences Appliquées (INSA) de Lyon, FR

ZBW German National Library of Economics, GE

BITM BitMedia, AT

KBL-AMBL Kanton Basel Land, CH

CT Collection Trust, UK

MEN Mendeley Ltd., UK

WM wissenmedia, GE

Abbreviations

EC European Commission

EEXCESS Enhancing Europe’s eXchange in Cultural Educational and Scientific resource

Acknowledgement: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 600601.

Date post:	24-Jul-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times