+ All Categories
Home > Technology > Camp 4-data workshop presentation

Camp 4-data workshop presentation

Date post: 11-May-2015
Category:
Upload: paolo-missier
View: 265 times
Download: 0 times
Share this document with a friend
Description:
A presentation at the CAMP-4-DATA workshop, Sept. 6, Lisbon: http://dcevents.dublincore.org/IntConf/index/pages/view/camp-4-data
Popular Tags:
21
Provenance Central: More Mileage from Provenance Metadata Bertram Ludäscher UC Davis, USA [email protected] Paolo Missier Newcastle University, UK [email protected] Members of the DataONE Provenance Working Group CAMP-4-DATA workshop @IPres 2013 Sept, 6, 2013 Lisbon, Portugal Friday, 6 September 13
Transcript
Page 1: Camp 4-data workshop presentation

Provenance Central:More Mileage from Provenance Metadata

Bertram LudäscherUC Davis, [email protected]

Paolo MissierNewcastle University, [email protected]

Members of the DataONE Provenance Working Group

CAMP-4-DATA workshop @IPres 2013Sept, 6, 2013

Lisbon, Portugal

Friday, 6 September 13

Page 2: Camp 4-data workshop presentation

Outline• A foundation for Provenance management: the PROV data model

– From the W3C. Recommendation as of Spring, 2013– generic, extensible model

• The role of provenance in the DataONE project– Provenance enables search and discovery, reuse, reproducibility – PBase: Provenance warehousing– Integration with the DataONE architecture– Provenance mining: the social life of research data

2

Friday, 6 September 13

Page 3: Camp 4-data workshop presentation

PROV: scope and structure

3 source: http://www.w3.org/TR/prov-overview/

Recommendationtrack

Prov-dictionaryplus:

Friday, 6 September 13

Page 4: Camp 4-data workshop presentation

PROV: scope and structure

3 source: http://www.w3.org/TR/prov-overview/

Recommendationtrack

Prov-dictionaryplus:

Friday, 6 September 13

Page 5: Camp 4-data workshop presentation

PROV Core Elements (graph depiction)

4

An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.

Entity

Activity

Agent

An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities.

An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.

drafting commenting

paper1

paper2

used draftv1

wasGeneratedBy used draftcomments

wasGeneratedBy

Alice

Bob

wasAssociatedWith

actedOnBehalfOf

Remote past Recent past

distribution=internalstatus=draftversion=0.1

ex:role=main_editor

type=personex:role=sr_editor

prov:role=editor

time=...time=...

Friday, 6 September 13

Page 6: Camp 4-data workshop presentation

Summary of the PROV Core model

5

– PROV-DC mapping available– Recent Tutorial @EDBT’13 (June, 2013) [1]

• Model, Constraints, Applications

[1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013.

Friday, 6 September 13

Page 7: Camp 4-data workshop presentation

PROV-DM relations at a glance

6

Friday, 6 September 13

Page 8: Camp 4-data workshop presentation

Context: ProvWG@DataONE

• DataONE: Data Observation Network for Earth– 5yr NSF DataNet data preservation project (current phase)– Provides a large scale, federated data infrastructure to the Earth Sciences

community

• Provenance Working Group– Active until July, 2014 (current phase, looking at extending)– One/two interns per year since 2010– One dedicated researcher (postdoc) since 2012– 12 core members, additional guest members on a rotation

• specific focus on the provenance of workflow-based e-science data

7

Friday, 6 September 13

Page 9: Camp 4-data workshop presentation

DataONE collaboration scenario - 2012

8

Alice’s Workflow: generates benchmark climate data for model comparison

Input is retrieved from DataONE to generate an output file

Friday, 6 September 13

Page 10: Camp 4-data workshop presentation

DataONE collaboration scenario - 2012

8

."."." ."."." ."."."

The workflow, provenance, and other metadata are uploaded to DataONEA data package is created and indexed

Friday, 6 September 13

Page 11: Camp 4-data workshop presentation

Searching

9

Bob: Search based on keywords in the metadata➡ including provenance terms

Bob discovers Alice’s workflow. He may be able to execute it again

Friday, 6 September 13

Page 12: Camp 4-data workshop presentation

PBase and DataONE

10

System Metadata

Extr

act-

Alig

n-Au

gmen

t Met

adat

a

Scie

nce

Data

Search API

Science Metadata

Provenance Cu

ratio

n

Inde

x Identifiers/ Text fields

Graph Structure

ProvExplorer

Internal Metadata Index

Repository

PBase /D-PROV

Querying

– Provenance traces in PBase linked to DataONE packages– Provenance traces indexed for searching

Friday, 6 September 13

Page 13: Camp 4-data workshop presentation

DataOne Provenance components I: D-PROV

11

6FLHQWLILF�([SHULPHQW

:RUNIORZ

LV&RPSRVHG2I

LV([HFXWHG$V

,QYRFDWLRQ

5XQ

'DWD

H[HFXWHV

PDQDJHV

JHQ%\

XVHVGHULYHG)URPWULJJHUV

GEBZIBLGZIBLGQDPH

GEBUXQBLG��GEBZIBLG�__�UXQBLG�UXQBLG

GEBLQYRFBLG��GEBUXQBLG�__�LQYRFBLG�LQYRFBLG

GEBGDWDBLG��GEBUXQBLG�__�GDWDBLG�GDWDBLG

W\SH��LQSXW�_�RXWSXW�_�LQWHUPHGLDWH

$FWRU

LV5HDOL]HG%\

LV,QVWDQFH2I

'DWD�&RQWDLQHU

LV&RQWDLQHG,Q

RXW

LQ

KROGV'DWD,Q

:RUNIORZ�/DQG

7UDFH�/DQG

WF�����Q�

D-PROV extends PROV - Connects trace metadata to workflow structure

Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.

Friday, 6 September 13

Page 14: Camp 4-data workshop presentation

DataOne Provenance components I: D-PROV

onOutPort

T1Inv

d

onInPort

T2Inv

wasAssociatedWith

T1

wasAssociatedWith

T2

op1

ip1

wf

isTaskOf

isTaskOf

hasInputPort

hasOutputPort

wfInvwasAssociatedWith

wasStartedBy

wasStartedBy

dataLink

12

D-PROV extends PROVConnects trace metadata to workflow structure

Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.

Friday, 6 September 13

Page 15: Camp 4-data workshop presentation

DataOne Provenance components II: PBase

13

R ➞ DProv

T ➞ DProv

V ➞ DProv

eSc ➞ DProv

Tr ➞ DProv

K ➞ DProv

Neo4J&loader& Graph&storage&

Query&layer&

indexing&

Analy8cal&services&

Friday, 6 September 13

Page 16: Camp 4-data workshop presentation

DataOne Provenance components II: PBase

13

R ➞ DProv

T ➞ DProv

V ➞ DProv

eSc ➞ DProv

Tr ➞ DProv

K ➞ DProv

In-house components

Neo4J&loader& Graph&storage&

Query&layer&

indexing&

Analy8cal&services&

Neo4J graph DBMS[AllegroGraph][Graph-*] Can we do better

than the built-in Neo indexing?

Friday, 6 September 13

Page 17: Camp 4-data workshop presentation

DataOne Provenance components II: PBase

13

R ➞ DProv

T ➞ DProv

V ➞ DProv

eSc ➞ DProv

Tr ➞ DProv

K ➞ DProv

In-house components

Neo4J&loader& Graph&storage&

Query&layer&

indexing&

Analy8cal&services&

Neo4J graph DBMS[AllegroGraph][Graph-*]

Cypher (Neo, declarative)[Gremlin (procedural)]can we do better? scaling graph queries

Can we do better than the built-in Neo indexing?

to be developed

Friday, 6 September 13

Page 18: Camp 4-data workshop presentation

Baseline provenance queries in PBase

14

Ancestors and descendents (lineage): [2,3]– Which datasets were involved in the production of data at node “e33”?– Reachability: was task “e11_miny” involved in producing data at node “e38”?

Execution analysis: [3]– Which tasks did not execute to completion for execution X of a given workflow?– Find all inputs [outputs] of a given workflow across all its executions– Given a data item, find all workflows / tasks that have used it as input– Suppose we discover that service S has a bug, which data products were impacted by it?– How many times was task T activated across a pool of workflow executions?

Provenance differencing: [4]– Why do the results from two executions of the same workflow differ?

Attribution: [5]– Who was responsible for this {data {usage, production}, service invocation}?

[2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013.

[3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012.

[4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013

[5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun, and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150.

Friday, 6 September 13

Page 19: Camp 4-data workshop presentation

Application - The social life of research data• We know all about searching in the publications space

– who else is working on problems similar to mine?– which results are available?

• In the data and process space:1.Search and discovery

• Who else has used the {datasets, services, workflows,...} I am using?– how do others rate them?

• Who used my {datasets, services, workflows,...}? How were they used?

2.Reuse, reproduction, validation• Can I reproduce these results?

– using the same exact method– using a variation of the method

• How do I apply this method to my data?• ...

15Social provenance for community building

Friday, 6 September 13

Page 20: Camp 4-data workshop presentation

From Pull (client queries) to Push (notifications)• Uncovering latent connections amongst services / data / people:

– Ranking, clustering, association rules– Requires new similarity metrics

• A recommender system for scientists– Analytics layer activated when new traces are added

• Challenges:– How large a corpus of provenance graphs is needed? – How global should the community be?

• Little new to discover in a small community– Requires graphs with rich attribution / association relations

16

Graph&storage&

Query&layer&

indexing&

Analy5cal&services&

Friday, 6 September 13

Page 21: Camp 4-data workshop presentation

Credits

17

Current members of the DataONE Provenance Working Group:

In the USA:Bertram Ludaescher, UC Davis (co-lead)Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher)Saumen Dey, UC Davis (researcher)Parisa Kianmajd, UC Davis (intern)Juliana Freire, NYU-PolyDavid Koop, NYU-PolyFernando Chirigati, NYU-PolyShawn Bowers, Gonzaga UniversityIlkay Altintas, SDSC/UCSDKarthik Ram, UC BerkeleyYolanda Gil,USC - ISIYaxing Wei, ORNLDave Vieglais, DataONE Technical Lead

In the UK:Paolo Missier, Newcastle UniversityJames Cheney, University of EdinburghKhalid Belhajjame, University of Manchester

Friday, 6 September 13


Recommended