+ All Categories
Home > Technology > The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy...

The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy...

Date post: 10-May-2015
Category:
Upload: paolo-missier
View: 1,437 times
Download: 2 times
Share this document with a friend
Description:
Invited talk at the National Institute of Informatics (NII), Tokyo, July 2014
Popular Tags:
55
NII, Tokyo, July 2014 – Paolo Missier The W3C PROV standard: data model for the provenance of information, and enabler for trustworthy publication and exchange of open data Paolo Missier, PhD School of Computing Science Newcastle University Newcastle upon Tyne, UK NII, Tokyo, July, 2014
Transcript
Page 1: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

The W3C PROV standard:data model for the provenance of information,

and enabler for trustworthy publicationand exchange of open data

Paolo Missier, PhD

School of Computing Science

Newcastle University

Newcastle upon Tyne, UK

NII, Tokyo, July, 2014

Page 2: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Motivation: generating and publishing genomics data

• Next Generation Sequencing at the forefront of genomics• the number of DNA base pairs that can be sequenced per $ doubles every five

months (2010)

• In the UK, the cost of sequencing a single patient sample is currently just under $1.5K and decreasing

• Genetic testing: from research method to clinical diagnostic tool

• Key technology: Whole-exome / Whole-genome processing pipelines (WEP/WGP)

• Key problem: assessing the reliability of the results

Goal of data processing and interpretation:

to rapidly identify genetic mutations across the entire genome, which:• Have known associations to genetic diseases• Are unknown but potentially deleterious

Specifically important in the study of rare diseases

Page 3: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Data publication and reuse in science/biology/genomics

Public, genome-wide gene expression data is potentially highly reusable

Rung, Johan, and Alvis Brazma. “Reuse of Public Genome-Wide Gene Expression Data.” Nature Reviews. Genetics 14, no. 2 (March 2013): 89–99. doi:10.1038/nrg3394.

But:

• Published data must be provably correct, trustworthy

Approximately half of the studies that use public gene expression data rely solely on existing data without adding newly generated data, and half of them use the public data in combination with new data.

Problem:

• A large WEP/ WES space, many experimental configurations, many possible results

Page 4: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Workflow for programming pipelines

Page 5: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Multiple Workflow systems for implementing pipelines…

[1] Torri, Federica, Ivo D Dinov, Alen Zamanyan, Sam Hobel, Alex Genco, Petros Petrosyan, Andrew P Clark, et al. “Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows.” Genes 3, no. 3 (August 30, 2012): 545–575. doi:10.3390/genes3030545.

[2] Goecks, Jeremy, Anton Nekrutenko, and James Taylor. “Galaxy: A Comprehensive Approach for Supporting Accessible, Reproducible, and Transparent Computational Research in the Life Sciences.” Genome Biology 11, no. 8 (January 2010): R86. doi:10.1186/gb-2010-11-8-r86.

[3] Reid, Jeffrey, Andreq Carroll, Narayanan Veeraraghavan, and Mahmoud Dahdouli. “Launching Genomics into the Cloud: Deployment of Mercury, a next Generation Sequence Analysis Pipeline.” BMC Bioinformatics (2014).

Loni pipeline (UCLA, USA) [1]Newcastle, UK [4]

Mercury [3]Baylor College of Medicine,Houston. Tx., USA)

[2]

[4] Watson, Paul, Hugo Hiden, and Simon Woodman. “E-Science Central for CARMEN: Science as a Service.” Concurrency and Computation: Practice and Experience 22, no. 17 (2010): 2369–2380. doi:10.1002/cpe.1611.

Page 6: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Multiple Pipeline configurations

Many tools to choose from, multiple ways to configure each tool

From: Pabinger, Stephan, Andreas Dander, Maria Fischer, Rene Snajder, Michael Sperk, Mirjana Efremova, Birgit Krabichler, Michael R Speicher, Johannes Zschocke, and Zlatko Trajanoski. “A Survey of Tools for Variant Analysis of next-Generation Genome Sequencing Data.” Briefings in Bioinformatics (January 21, 2013): bbs086–. doi:10.1093/bib/bbs086.

Page 7: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

… and different configurations yield very different results

Outcomes are very sensitive to pipeline configuration

False positives, false negatives

The set of genetic mutations identified in one individual may vary

greatly depending on the tools used

Also: tools evolve over time longitudinal variations over results

Page 8: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

The Cloud-e-Genome project

Goal 1:

provide mechanisms to rapidly and flexibly create new WEP pipelines, and to deploy them in a scalable way;

Goal 2:provide clinicians with a tool for analysis and interpretation of human variants

• 2 year pilot project• Funded by UK’s National Institute for Health Research (NIHR)

through the Biomedical Research Council (BRC)

Challenge:

to deliver the benefits of WES/WGS technology to clinical practice

NGS data processing

Human variant interpretation for clinical diagnosis

Page 9: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Implementing the pipeline using workflow technology

Page 10: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Pipeline evolution

Pipeline:

set C = { c1 … cn } of components -- tool wrappers

Each ci has a configuration conf(ci) and a version v(ci)

…and why

• Technology / algorithm evolution• Traditional GATK variant caller

GATK haplotype caller• Does the interface change?• Do the operational assumptions

change?

Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing

What can change

1 – Tool version:v(ci) v’(ci)

2 - Tool replacement / add / remove:ci c’I

3 – Configuration parametersconf(ci) conf’(ci)

(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013

Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer to over 70 tools

Page 11: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

How do you know published results are sound?

Mechanisms for data dissemination exist

Data journals

Data repositories

Data structures: Research Objects(from ResearchObject.org)

Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004.

… but they are not enough to meet two key requirements:

• Attribution of published data to its producers

• Verifiability and reproducibility of scientific results

Page 12: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Role of provenance

Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)

Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)

• Provenance is evidence in support of clinical diagnosis1. Why do these variants appear in the output list?

2. Why have you concluded they are disease-causing?

• Requires ability to trace variants through workflow execution

• Workflow managers provide this

“Why are these variants included in the results?”

“Why do these two results differ?”

Page 13: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Why does provenance matter?

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable process analysis for debugging, improvement,

evolution

Page 14: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

The W3C Working Group on Provenance: timeline

14

W3CIncubator groupon provenance

Chair: Yolanda Gil, ISI, USC

W3Cworking groupapproved

Chairs: Luc Moreau,Paul Groth

2009-2010

Main output:“Provenance XG Final Report”http://www.w3.org/2005/Incubator/prov/XGR-prov/- provides an overview of the various existing approaches, vocabularies- proposes the creation of a dedicated W3C Working

Group

April, 2011 April, 2013

ProposedRecommendationsfinalised

prov-dm: Data Modelprov-o: OWL ontology, RDF encodingprov-n: prov notationprov-constraints

...plus a number of non-prescriptive Notes

http://www.w3.org/2011/prov/wiki/

Page 15: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

PROV: scope and structure

15

source: http://www.w3.org/TR/prov-overview/

Recommendationtrack

Page 16: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

PROV Core Elements (graph depiction)

16

An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.

An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities.

An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.

Jump to alternate

Page 17: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Generation, Usage

17

Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation.

Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity

PROV is based on a notion of instantaneous events, that mark transitions in the world

- generation, usage (and others)

Ordering constraints amongst events:

“generation of e must precede each of usages”

“a can only use / generate e after it has started and before it has ended”

Page 18: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Concepts and relations

18

Generation of “draft v1” expressed as relation:

wasGeneratedBy(“draft v1”, ...)

Usage of “draft v1” by “commenting” expressed as relation:

used(“commenting, “draft v1”,...)

Page 19: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

PROV notation

19

document

prefix prov <http://www.w3.org/ns/prov#>prefix ex <http://www.example.com/>

entity(ex:draftComments)entity(ex:draftV1, [ ex:distr='internal', ex:status = "draft"])entity(ex:paper1)entity(ex:paper2)

activity(ex:commenting)activity(ex:drafting)wasGeneratedBy(ex:draftComments, ex:commenting, 2013-03-18T11:10:00)used(ex:commenting, ex:draftV1, -)wasGeneratedBy(ex:draftV1, ex:drafting, -)used(ex:drafting, ex:paper1, -)used(ex:drafting, ex:paper2, -)

endDocument

Page 20: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sierSame example — PROV-O notation (RDF/N3)

20

:draftComments a prov:Entity ; :distr "internal"^^xsd:string ; prov:wasGeneratedBy :commenting .

:commenting a prov:Activity ; prov:used :draftV1 .

:draftV1 a prov:Entity ; :distr "internal"^^xsd:string ; :status "draft"^^xsd:string ; :version "0.1"^^xsd:string ; prov:wasGeneratedBy :drafting .

:drafting a prov:Activity ; prov:used :paper1, :paper2 .

:paper1 a prov:Entity, "reference"^^xsd:string .

:paper2 a prov:Entity, "reference"^^xsd:string .

Page 21: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Association, Attribution, Delegation: who did what?

21

An activity association is an assignment of responsibility to an agent for an activity, indicating that the agent had a role in the activity.

Attribution is the ascribing of an entity to an agent.

entity(ex:draftComments, [ ex:distr='internal' ])activity(ex:commenting)agent(ex:Bob, [prov:type = "mainEditor"] )agent(ex:Alice, [prov:type = "srEditor"])

wasAssociatedWith(ex:commenting, Bob, -, [prov:role = "editor"])actedOnBehalfOf(Bob, Alice)wasAttributedTo(ex:draftComments, ex:Bob)

Page 22: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sierSame example — PROV-O notation (RDF/N3)

22

:Alice a prov:Agent, "ex:chiefEditor"; :firstName "Alice"; :lastName "Cooper".

:Bob a prov:Agent, "ex:seniorEditor"; :firstName "Robert"; :lastName "Thompson"^; prov:actedOnBehalfOf :Alice .

:draftComments prov:wasAttributedTo :Bob .:drafting a prov:Activity ; prov:wasAssociatedWith :Bob .

Page 23: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Association and Attribution

23

Q.: what is the relationship between attribution and association?

This is defined as an inference rule in the PROV-CONSTR document

entity(e)agent(Ag)activity(a)

wasAttributedTo(e, Ag)wasGeneratedBy(e, a) wasAssociatedWith(a, Ag)

Page 24: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Communication amongst activities

24

Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other.

activity(ex:commenting)activity(ex:drafting)

wasInformedBy(ex:commenting, ex:drafting)

:drafting a prov:Activity .

:commenting a prov:Activity ; prov:wasInformedBy :drafting .

Page 25: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Communication, generation, usage

25

activity(ex:commenting)activity(ex:drafting)entity(e)wasInformedBy(ex:commenting, ex:drafting)wasGeneratedBy(e,ex:drafting)used(ex:commenting, e)

Q.: what is the relationship between communication, generation, and usage?

This are inference rules 5 and 6 in the PROV-CONSTR document

Page 26: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Summary of the PROV Core model

26

Page 27: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Derivation amongst entities

27

A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.

entity(ex:draftV1)entity(ex:draftComments)wasDerivedFrom(ex:draftComments, ex:draftV1)

Q.: what is the relationship between derivation, generation, and usage?

:draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 .

:draftV1 a prov:Entity .

Page 28: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Relations may be given identifiers

28

entity(ex:draftComments)entity(ex:draftV1)activity(ex:commenting)wasGeneratedBy(gen1; ex:draftComments, ex:commenting, -)used(use1; ex:commenting, ex:draftV1, -)

gen1 denotes a generation event

use1 denotes a usage event

wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)

General derivation relation:

Relation IDs make it possible to refer to relations in other relations

Page 29: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Rendering N-ary relations in PROV-O

29

RDF is for binary relations —- N-ary relations require reification

entity(ex:draftComments)entity(ex:draftV1)activity(ex:commenting)wasGeneratedBy(gen1; ex:draftComments, ex:commenting, 2013-03-18T10:00:01)used(use1; ex:commenting, ex:draftV1, -)

:draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 .

:gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00".

:commenting a prov:Activity ; prov:qualifiedUsage :use1 .

:use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.

Page 30: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

“Qualified relation” RDF pattern

30

:draftComments a prov:Entity ; prov:qualifiedGeneration :gen1 .

:gen1 a prov:Generation ; prov:activity :commenting; prov:atTime “2013-03-18T10:00:01+09:00".

:commenting a prov:Activity ; prov:qualifiedUsage :use1 .

:use1 a prov:Usage ; :note "found comments useful"; prov:atTime "2013-03-21T10:00:01+09:00"; prov:entity :draftV1.

Page 31: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Plans — why was something done?

31

Most relation types have two arguments which are { Entity, Activity, Agent}

Derivation is one exception:

wasDerivedFrom(id; e2, e1, a, g2, u1, attrs)

Two other notable exceptions: - Associations with a plan- Delegation with an activity scope

wasAssociatedWith(id; a, ag, pl, attrs)

A plan is an entity that represents a set of actions or steps intended by one or more agents to achieve some goal

Page 32: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Association with a plan

32

A plan plays a role in an association

Page 33: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Plans are typed entities

33

activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])agent(ex:_aJVM, [prov:type = “JVM-6.0”])entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label="Program 1"])

wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role="defaultRuntime", ex:accessPath="webapp" ])

A plan is an entity having prov:type = “prov:plan”

Page 34: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Plan pattern as PROV-O

34

:_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] .

:_aJVM a prov:Agent, “Java-6.0".

:myCleverProgram a prov:Entity, prov:Plan.

activity(ex:_aProgramExecution, [ex:execTime="22.5sec"])agent(ex:_aJVM, [prov:type = “JVM-6.0”])entity(ex:myCleverProgram, [prov:type='prov:Plan', ex:label="Program 1"])

wasAssociatedWith(ex:_aProgramExecution, ex:_aJVM, ex:myCleverProgram, [prov:role="defaultRuntime", ex:accessPath="webapp" ])

Page 35: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Plan pattern as PROV-O

35

:_aProgramExecution a prov:Activity ; :execTime "22.5sec; prov:qualifiedAssociation [ a prov:Association ; :accessPath "webapp"; prov:agent :_aJVM ; prov:hadPlan :myCleverProgram ; prov:hadRole "defaultRuntime"] .

:_aJVM a prov:Agent, “Java-6.0".

:myCleverProgram a prov:Entity, prov:Plan.

Page 36: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Delegation within an activity scope

36

Page 37: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Real-world artifacts vs provenance entities

37

ref: http://www.w3.org/2001/sw/wiki/PROV-FAQ#Examples_of_Provenance

“What do I know about the car I see in this Cambridge street today?”

•It was bought by Joe in 2011

•Joe drove it to Boston on March 16th, 2013. The car has now got 10,000 miles on it

•Joe drove it to Cambridge on March 18th, 2013.

“Same” car, but different provenance at each stage of its evolution

To Core Elements

Page 38: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Alternate-specialization pattern

38

Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time.

An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter.

...But, this is still that car!

Semantic notes:1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)

3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).

To Core Elements

differing in their location

same owner, added location

Page 39: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Reserved attributes and types

39

A small set of reserved attributes, with some usage restrictions

Page 40: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Bundles, provenance of provenance

40

A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed.

bundle pm:bundle1

entity(ex:draftComments)entity(ex:draftV1)

activity(ex:commenting)wasGeneratedBy(ex:draftComments, ex:commenting,-) used(ex:commenting, ex:draftV1, -)endBundle...entity(pm:bundle1, [ prov:type='prov:Bundle' ])wasGeneratedBy(pm:bundle1, -, 2013-03-20T10:30:00)wasAttributedTo(pm:bundle1, ex:Bob)

Page 41: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Bundles in PROV-O

41

Bundle definition (an RDF named graph):

ex:bundle1 { :draftComments a prov:Entity ; :status “blah"; prov:wasGeneratedBy :commenting .

:commenting a prov:Activity ; prov:used :draftV1 .

:draftV1 a prov:Entity .}

Bundle usage:

ex:bundle1 a prov:Entity, "prov:Bundle"; prov:qualifiedGeneration [ a prov:Generation ; prov:atTime “2013-03-20T10:30:00+09:00" ]; prov:wasAttributedTo :Bob .

Page 42: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Time, Events

42

wasStartedBy(id; a2, e, a1, t, attrs)

wasEndedBy(id; a2, e, a1, t, attrs)

Instead, the PROV data model is implicitly based on a notion of instantaneous events, that mark transitions in the world (*)

(*) PROV-CONSTR http://www.w3.org/TR/prov-constraints/#events (non-normative)

Events:

- activity start, activity end,

- entity generation , entity usage, entity invalidation

- Provenance statements are combined by different systems

- An application may not be able to align the times involved to a single global timeline

Therefore, PROV minimizes assumptions about time

Page 43: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

From “scruffy” provenance to “valid” provenance

43

- Are all possible temporal partial ordering of events equally acceptable?- How can we specify the set of all valid orderings?

More generally, how do we formally define what it means for a set of provenance statements to be valid?

PROV defines a set of temporal constraints that ensure consistency of a provenance graph

Page 44: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Exploiting provenance: why do my results differ from yours?

Run pipeline version V1

V1 V2:Replace BWA versionModify Annovar configuration parameters

Variant list VL1

Variant list VL2Run pipeline version V2

??

Variant list VL1

Variant list VL2

DDIFF(data differencing)

PDIFF(provenance differencing)

Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.

Page 45: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

PDIFF - overview

WA

WB

Page 46: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

The corresponding provenance traces

Page 47: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Delta graph computed by PDIFF

PDIFF helps determine the impact of variations in the pipeline

Page 48: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Provenance of Linked Open Data resources

Goal: to establish a LD-compliant association between an LD resource and a description of its provenance

• Where does the provenance of a LD resource live?

• How can it be accessed?

Why?

1. to enable LD search and discovery• By indexing data by its provenance

• Ex. “Find all resources for which Alice is an author which contain data derived from dataset D”

2. To enable reasoning about quality/reliability of the LD resource• Predicates and rules over provenance

• Ex. “if D has been derived from either {A,B,C} and Alice is one of the authors, then score X”

Page 49: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Provenance of Linked Open Data resources: how

How: Three mechanisms:

1. Provenance Access and Query (PROV-AQ) – part of the W3C PROV recommendation suite

2. Embedding provenance statements within the resource itself• Eg the “Nanopublication” model

3. Packaging data + provenance as a Research Object

Page 50: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

1. Provenance pingback and query service

Image reproduced from:De Nies, Tom, Robert Meusel, Kai Eckert, Dominique Ritze, and Anastasia Dimou. “A Lightweight Provenance Pingback and Query Service for Web Publications.” In Procs. IPAW 2014. Cologne, Germany: Springer, 2014.

Objective: to decouple publishing of content and of its provenance (as LOD)

Scenario:• Publishers publish content resources, are not responsible for provenance

• Eg. Mendeley, ResearchGate, etc.• Authors publish provenance, are not responsible for publishing content

Page 51: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

2. Provenance Embedding

The nanopublication model is an example of provenance embedding within a published RDF document

From nanopub.org:

A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.Individual nanopublications can be cited by others and tracked for their impact on the community.

Page 52: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Nanopublication: example

Assertion: an “association” between a gene and a genetic disorder. The strength of this association is given by a statistical p-value. See nanopub.org for details

{ : a nanopub:Nanopublication ; nanopub:hasAssertion :NanoPub_1_Assertion ; nanopub:hasProvenance :NanoPub_1_Provenance . :NanoPub_1_Provenance nanopub:hasAttribution :NanoPub_1_Attribution ; nanopub:hasSupporting :NanoPub_1_Supporting . :NanoPub_1_Assertion a nanopub:Assertion . :NanoPub_1_Provenance a nanopub:Provenance . :NanoPub_1_Attribution a nanopub:Attribution . :NanoPub_1_Supporting a nanopub:Supporting .}:NanoPub_1_Assertion { :Association_1 a sio:statistical-association ; sio:has-measurement-value :Association_1_p_value ; sio:refers-to ...}:NanoPub_1_Attribution { :pav:authoredBy res_a, reS_b. :NanoPub_1_Assertion pav:createdBy ...;}:NanoPub_1_Supporting { :Association_1 opm:wasDerivedFrom gene_disease_concept_profiles_1980_2010...; opm:wasGeneratedBy gene_disease_concept_profiles_matching_1980_2010; .}

Page 53: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

3. Research Objects for data and provenance packaging

Research Objects (ROs) are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations.

A Research Object is a combination of:• Aggregation (reusing Object Reuse and Exchange [ORE])• Annotation (reusing the Annotation Ontology [AO])• RO ontologies

From the Wf4Ever EU project

See also: Belhajjame K, Corcho O, Garijo D, Zhao J, Missier P, Newman DR, Palma R, Bechhofer S et al.: Workflow-Centric Research Objects: A First Class Citizen in the Scholarly Discourse. In proceedings of the ESWC2012 Workshop on the Future of Scholarly Communication in the Semantic Web (SePublica2012), Heraklion, Greece, May 2012

Page 54: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier

Links to resources cited in the talk

• The PROV Data Model (PROV-DM): www.w3.org/TR/prov-dm/• A primer on PROV with a simple running example:http://www.w3.org/TR/prov-primer/

• LD and PROV:• Nanopublications: nanopub.org• Research Objects: researchobject.org• The Wf4Ever project: www.wf4ever-project.org• PROV Access and Query conventions (PROV-AQ): http://www.w3.org/TR/prov-aq/

• Visualising provenance using PROV-O-Viz: http://provoviz.org/• PROV-O-Viz video:• PROV-O-Viz IPAW’14 paper preprint: http://dare.ubvu.vu.nl/handle/1871/51388• Reference:

Hoekstra, Rinke, and Paul Groth. “PROV-O-Viz - Understanding the Role of Activities in Provenance.” In Procs. IPAW 2014. Springer, 2014.

Page 55: The W3C PROV standard:data model for the provenance of information, and enabler for trustworthy publicationand exchange of open data

NII,

Tok

yo, J

uly

2014

– P

aolo

Mis

sier


Recommended