Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | paolo-missier |
View: | 265 times |
Download: | 0 times |
Provenance Central:More Mileage from Provenance Metadata
Bertram LudäscherUC Davis, [email protected]
Paolo MissierNewcastle University, [email protected]
Members of the DataONE Provenance Working Group
CAMP-4-DATA workshop @IPres 2013Sept, 6, 2013
Lisbon, Portugal
Friday, 6 September 13
Outline• A foundation for Provenance management: the PROV data model
– From the W3C. Recommendation as of Spring, 2013– generic, extensible model
• The role of provenance in the DataONE project– Provenance enables search and discovery, reuse, reproducibility – PBase: Provenance warehousing– Integration with the DataONE architecture– Provenance mining: the social life of research data
2
Friday, 6 September 13
PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendationtrack
Prov-dictionaryplus:
Friday, 6 September 13
PROV: scope and structure
3 source: http://www.w3.org/TR/prov-overview/
Recommendationtrack
Prov-dictionaryplus:
Friday, 6 September 13
PROV Core Elements (graph depiction)
4
An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.
Entity
Activity
Agent
An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
drafting commenting
paper1
paper2
used draftv1
wasGeneratedBy used draftcomments
wasGeneratedBy
Alice
Bob
wasAssociatedWith
actedOnBehalfOf
Remote past Recent past
distribution=internalstatus=draftversion=0.1
ex:role=main_editor
type=personex:role=sr_editor
prov:role=editor
time=...time=...
Friday, 6 September 13
Summary of the PROV Core model
5
– PROV-DC mapping available– Recent Tutorial @EDBT’13 (June, 2013) [1]
• Model, Constraints, Applications
[1] Missier, Paolo, Khalid Belhajjame, and James Cheney. “The W3C PROV Family of Specifications for Modelling Provenance Metadata.” In Procs. EDBT’13 (Tutorial). Genova, Italy: ACM, 2013.
Friday, 6 September 13
PROV-DM relations at a glance
6
Friday, 6 September 13
Context: ProvWG@DataONE
• DataONE: Data Observation Network for Earth– 5yr NSF DataNet data preservation project (current phase)– Provides a large scale, federated data infrastructure to the Earth Sciences
community
• Provenance Working Group– Active until July, 2014 (current phase, looking at extending)– One/two interns per year since 2010– One dedicated researcher (postdoc) since 2012– 12 core members, additional guest members on a rotation
• specific focus on the provenance of workflow-based e-science data
7
Friday, 6 September 13
DataONE collaboration scenario - 2012
8
Alice’s Workflow: generates benchmark climate data for model comparison
Input is retrieved from DataONE to generate an output file
Friday, 6 September 13
DataONE collaboration scenario - 2012
8
."."." ."."." ."."."
The workflow, provenance, and other metadata are uploaded to DataONEA data package is created and indexed
Friday, 6 September 13
Searching
9
Bob: Search based on keywords in the metadata➡ including provenance terms
Bob discovers Alice’s workflow. He may be able to execute it again
Friday, 6 September 13
PBase and DataONE
10
System Metadata
Extr
act-
Alig
n-Au
gmen
t Met
adat
a
Scie
nce
Data
Search API
Science Metadata
Provenance Cu
ratio
n
Inde
x Identifiers/ Text fields
Graph Structure
ProvExplorer
Internal Metadata Index
Repository
PBase /D-PROV
Querying
– Provenance traces in PBase linked to DataONE packages– Provenance traces indexed for searching
Friday, 6 September 13
DataOne Provenance components I: D-PROV
11
6FLHQWLILF�([SHULPHQW
:RUNIORZ
LV&RPSRVHG2I
LV([HFXWHG$V
,QYRFDWLRQ
5XQ
'DWD
H[HFXWHV
PDQDJHV
JHQ%\
XVHVGHULYHG)URPWULJJHUV
GEBZIBLGZIBLGQDPH
GEBUXQBLG��GEBZIBLG�__�UXQBLG�UXQBLG
GEBLQYRFBLG��GEBUXQBLG�__�LQYRFBLG�LQYRFBLG
GEBGDWDBLG��GEBUXQBLG�__�GDWDBLG�GDWDBLG
W\SH��LQSXW�_�RXWSXW�_�LQWHUPHGLDWH
$FWRU
LV5HDOL]HG%\
LV,QVWDQFH2I
'DWD�&RQWDLQHU
LV&RQWDLQHG,Q
RXW
LQ
KROGV'DWD,Q
:RUNIORZ�/DQG
7UDFH�/DQG
WF�����Q�
D-PROV extends PROV - Connects trace metadata to workflow structure
Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.
Friday, 6 September 13
DataOne Provenance components I: D-PROV
onOutPort
T1Inv
d
onInPort
T2Inv
wasAssociatedWith
T1
wasAssociatedWith
T2
op1
ip1
wf
isTaskOf
isTaskOf
hasInputPort
hasOutputPort
wfInvwasAssociatedWith
wasStartedBy
wasStartedBy
dataLink
12
D-PROV extends PROVConnects trace metadata to workflow structure
Missier, Paolo, Saumen Dey, Khalid Belhajjame, Victor Cuevas, and Bertram Ludaescher. “D-PROV: Extending the PROV Provenance Model with Workflow Structure.” In Procs. TAPP’13. Lombard, IL, 2013.
Friday, 6 September 13
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
Neo4J&loader& Graph&storage&
Query&layer&
indexing&
Analy8cal&services&
Friday, 6 September 13
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house components
Neo4J&loader& Graph&storage&
Query&layer&
indexing&
Analy8cal&services&
Neo4J graph DBMS[AllegroGraph][Graph-*] Can we do better
than the built-in Neo indexing?
Friday, 6 September 13
DataOne Provenance components II: PBase
13
R ➞ DProv
T ➞ DProv
V ➞ DProv
eSc ➞ DProv
Tr ➞ DProv
K ➞ DProv
In-house components
Neo4J&loader& Graph&storage&
Query&layer&
indexing&
Analy8cal&services&
Neo4J graph DBMS[AllegroGraph][Graph-*]
Cypher (Neo, declarative)[Gremlin (procedural)]can we do better? scaling graph queries
Can we do better than the built-in Neo indexing?
to be developed
Friday, 6 September 13
Baseline provenance queries in PBase
14
Ancestors and descendents (lineage): [2,3]– Which datasets were involved in the production of data at node “e33”?– Reachability: was task “e11_miny” involved in producing data at node “e38”?
Execution analysis: [3]– Which tasks did not execute to completion for execution X of a given workflow?– Find all inputs [outputs] of a given workflow across all its executions– Given a data item, find all workflows / tasks that have used it as input– Suppose we discover that service S has a bug, which data products were impacted by it?– How many times was task T activated across a pool of workflow executions?
Provenance differencing: [4]– Why do the results from two executions of the same workflow differ?
Attribution: [5]– Who was responsible for this {data {usage, production}, service invocation}?
[2] Dey, Saumen, Víctor Cuevas-Vicenttín, Sven Köhler, Eric Gribkoff, Michael Wang, and Bertram Ludäscher. "On implementing provenance-aware regular path queries with relational query engines." In Proceedings of the Joint EDBT/ICDT 2013 Workshops, pp. 214-223. ACM, 2013.
[3] Dey, Saumen, Sven Köhler, Shawn Bowers, and Bertram Ludäscher. "Datalog as a lingua franca for provenance querying and reasoning." In Workshop on the theory and practice of provenance (TaPP). 2012.
[4] Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience, 2013
[5] Missier, Paolo, Bertram Ludäscher, Saumen Dey, Michael Wang, Tim McPhillips, Shawn Bowers, Michael Agun, and Ilkay Altintas. "Golden Trail: Retrieving the Data History that Matters from a Comprehensive Provenance Repository." International Journal of Digital Curation 7, no. 1 (2012): 139-150.
Friday, 6 September 13
Application - The social life of research data• We know all about searching in the publications space
– who else is working on problems similar to mine?– which results are available?
• In the data and process space:1.Search and discovery
• Who else has used the {datasets, services, workflows,...} I am using?– how do others rate them?
• Who used my {datasets, services, workflows,...}? How were they used?
2.Reuse, reproduction, validation• Can I reproduce these results?
– using the same exact method– using a variation of the method
• How do I apply this method to my data?• ...
15Social provenance for community building
Friday, 6 September 13
From Pull (client queries) to Push (notifications)• Uncovering latent connections amongst services / data / people:
– Ranking, clustering, association rules– Requires new similarity metrics
• A recommender system for scientists– Analytics layer activated when new traces are added
• Challenges:– How large a corpus of provenance graphs is needed? – How global should the community be?
• Little new to discover in a small community– Requires graphs with rich attribution / association relations
16
Graph&storage&
Query&layer&
indexing&
Analy5cal&services&
Friday, 6 September 13
Credits
17
Current members of the DataONE Provenance Working Group:
In the USA:Bertram Ludaescher, UC Davis (co-lead)Victor Cuevas Vicenttin, UC Davis (DataONE postdoc researcher)Saumen Dey, UC Davis (researcher)Parisa Kianmajd, UC Davis (intern)Juliana Freire, NYU-PolyDavid Koop, NYU-PolyFernando Chirigati, NYU-PolyShawn Bowers, Gonzaga UniversityIlkay Altintas, SDSC/UCSDKarthik Ram, UC BerkeleyYolanda Gil,USC - ISIYaxing Wei, ORNLDave Vieglais, DataONE Technical Lead
In the UK:Paolo Missier, Newcastle UniversityJames Cheney, University of EdinburghKhalid Belhajjame, University of Manchester
Friday, 6 September 13