+ All Categories
Home > Science > How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Date post: 15-Apr-2017
Category:
Upload: aims-agricultural-information-management-standards
View: 314 times
Download: 3 times
Share this document with a friend
22
How to describe a dataset. Interoperability issues Valeria Pesce Global Forum on Agricultural Research
Transcript
Page 1: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

How to describe a dataset. Interoperability issues

Valeria PesceGlobal Forum on Agricultural Research

Page 2: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Definition of “dataset”The term “dataset” has been defined in several ways, all of which further specify or extend the basic concept of “a collection of data”.

Definition given by the W3C Government Linked Data Working Group:

A dataset is “a collection of data, published or curated by a single source, and available for access or download in one or more formats”

The “instances” of the dataset “available for access or download in one or more formats” are called “distributions”. A dataset can have many distributions.

Examples of distributions include a downloadable CSV file, an API or an RSS feed.

Page 3: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Definition of “interoperability”

“Data interoperability is a feature of datasets - and of information services that give access to datasets - whereby data can easily be retrieved, processed, re-used, and re-packaged (“operated”) by other systems.”

Interim Proceedings of International Expert Consultation on “Building the CIARD Framework for Data and Information Sharing”, CIARD (2011)

software applications

datasets have to be machine-readable

Page 4: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

What applications needBesides information common to any type of resource (name, author / owner, date…), applications have to find enough metadata about datasets to understand:

1. the specific coverage of the dataset (type of data, thematic coverage, geographic coverage)

2. the necessary technical specifications to retrieve and parse a distribution of the dataset (format, protocol etc.)

3. the conditions for re-use (rights, licenses)4. the “dimensions” covered by the dataset (e.g. temperature,

time, salinity, gene, coordinates) 5. the semantics of the dimensions (units of measure, time

granularity, syntax, reference taxonomies)

Page 5: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Partial answers in existing vocabularies

• DCAT vocabulary– RDF vocabulary for describing any dataset– Datasets can be standalone or part of a “catalog”– Datasets are accessible through several “distributions”– “Other, complementary vocabularies may be used together with DCAT to provide

more detailed format-specific information. For example, properties from the VoID vocabulary can be used if that dataset is in RDF format.”

• VOID vocabulary– RDF vocabulary for expressing metadata about RDF datasets

• (SDMX ) DataCube vocabulary– RDF vocabulary for describing statistical datasets– Useful for attaching metadata about the “data structure” to any dataset that

doesn’t follow a known published standard

Page 6: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Coverage of a dataset• This can be handled by common Dublin Core properties like subject and

coverage.• DCAT re-uses these DC properties.

Issue 1: No specific property for the type of data covered in a dataset

The values of these properties have to be understood by machines:- The value should be standardized, possibly a URI- The URI should be de-referenceable to a thing - The thing should be part of an authority list / taxonomy

Issue 3: There is no authority vocabulary for types of data

Issue 1

Issue 2

Page 7: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Conditions for re-use

• DCAT re-uses the license DC property at the level of distributions

• DCAT re-uses the rights DC property at bith the level of dataset and the level of distribution

dc:license > dc:LicenseDocument

dc:rights > dc:RightsStatement

Page 8: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

W3C DCAT > DCAT AP

Page 9: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

DCAT core

Page 10: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Technical propertiesThe necessary technical specifications to retrieve and parse a distribution of a dataset (format, protocol etc.)• DCAT re-uses the DC format property;

Issue No property for protocol

The values of these properties have to be understood by machines, possibly URIs:

Issue2 No comprehensive RDF authority lists for these values (partial: DC Types; non-RDF: IANA types)

Issue 1

Issue 2

Page 11: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

VOIDVOID can help with the protocol metadata but only for RDF datasets:

- Property for data dump: dataDump- Property for SPARQL endpoint: sparqlEndpoint

Page 12: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

“Dimensions” and their semantics

DCAT does not describe the dimensions of a dataset, except for a reference to a standard if the dataset dimensions can be defined by a formalized standard (e.g. an XML schema or an RDF vocabulary or an ISO standard)

dc:conformsTo > dc:Standard

Statistical vocabularies can help with the description of the dimensions

Page 13: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

SDMX: data structure and dimensions

SDMX: Statistical Data and Metadata Exchange

The data structure definition is a description of all the metadata needed to understand the data set structure. This includes: • identification of the dimensions (Dimension) according to standard

statistical terminology, • the key structure (KeyDescriptor), • the code-lists (CodeList) that enumerate valid values for each dimension • coded attribute (CodedAttribute), information about whether attributes

are required or optional and coded or free text.

Given the metadata in the data structure definition, all of the data in the data set becomes meaningful.

Page 14: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

DataCube: simplified SDMX in RDF

Page 15: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

DataCube: simplified SDMX in RDF

Reference to a concept scheme

Page 16: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

DataCube: simplified SDMX in RDF

“Semantic role” of the property

Page 17: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

DataCube: simplified SDMX in RDF

“Semantic role” of the property

Page 18: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Combining different vocabularies

NameURLOwnerContent typeTopic(s)LanguageMetadata set(s)Data structureDistribution(s)[…]

DATASET

NameProtocolEndpoint URLMedia typeFormatSize

DISTRIBUTION

DCAT model

DimensionsAttributesMeasuresValue lists

DATA STRUCTURE

DataCube model

Catalog: the directory

Vocabulary(ies)SPARQL endpointData dumpSerialization formatNumber of triples

RDF dataset info

VOID properties

If one or more known published metadata sets are used, just fill “metadata set(s)”, otherwise link to a “data structure” with custom “dimensions”

IF media type has RDF or SPARQL response

Page 19: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Tools for managing dataset metadata• CKAN maintained by the Open Knowledge Foundation

Uses most of DCAT. Doesn’t describe dimensions. Also provides a global dataset hub called the Datahub

• Dataverse created by Harvard UniversityUses a custom vocabulary. Doesn’t describe dimensions.

• Commercial solutions

• Repositories and catalogs:OpenAIRE, DataCite (using re3data to search repositories) and Dryad use their own vocabularies.

• CIARD RINGUses full DCAT AP with some extended properties (protocol, data type) and local taxonomies with URIs mapped when possible to authorities.Next steps: adding DataCube properties for dimensions.

Page 20: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Major outstanding issues• Some missing properties in existing vocabularies: approach vocabulary owners OR extend vocabularies

• Missing vocabularies for protocols, formats approach standardizing bodies? perhaps specific dataset formats?

• Need for more standardized semantics for dimensions:

Joint discussions with the RDA Data Type Registries WG?

• Lack of interoperability metadata in existing tools

Page 21: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

References• W3C DCAT: http://www.w3.org/TR/vocab-dcat/• DCAT AP: https://

joinup.ec.europa.eu/asset/dcat_application_profile/asset_release/dcat-application-profile-data-portals-europe-final

• DataCube: http://purl.org/linked-data/cube#• VOID: http://rdfs.org/ns/void-guide • VIVO Datastar: http://sourceforge.net/projects/vivo/files/Datastar%20ontology/• CERIF for datasets: https://cerif4datasets.wordpress.com/c4d-deliverables/• CKAN: http://ckan.org/• Datahub: http://datahub.io/• DataCite: http://search.datacite.org/ui?q=subject%3Aagriculture • Re3data: http://www.re3data.org • Dryad: http://datadryad.org/ • OpenAIRE: https://www.openaire.eu/

Page 22: How to Describe a Dataset. Interoperability Issues, by Valeria Pesce

Thank you

Valeria PesceGlobal Forum on Agricultural Research


Recommended