Understanding Collaborative Studies Through Interoperable...

Understanding Collaborative Studies ThroughInteroperable Workflow Provenance

Ilkay Altintas1,2, Manish Kumar Anand1, Daniel Crawl1, Shawn Bowers3, AdamBelloum2, Paolo Missier4, Bertram Ludascher5, Carole A. Goble4, Peter M.A. Sloot2

1San Diego Supercomputer Center, University of California, San Diego, USA{altintas, mkanand, crawl}@sdsc.edu

2Computational Science, University of Amsterdam, The Netherlands{A.S.Z.Belloum, p.m.a.sloot}@uva.nl

3Department of Computer Science, Gonzaga [email protected]

4School of Computer Science, University of Manchester, Manchester, UK{pmissier, carole.goble}@cs.man.ac.uk

5UC Davis Genome Center, University of California, [email protected]

Abstract. The provenance of a data product contains information about how theproduct was derived, and is crucial for enabling scientists to easily understand,reproduce, and verify scientific results. Currently, most provenance models aredesigned to capture the provenance related to a single run, and mostly executedby a single user. However, a scientific discovery is often the result of methodi-cal execution of many scientific workflows with many datasets produced at dif-ferent times by one or more users. Further, to promote and facilitate exchangeof information between multiple workflow systems supporting provenance, theOpen Provenance Model (OPM) has been proposed by the scientific workflowcommunity. In this paper, we describe a new query model that captures implicituser collaborations. We show how this model maps to OPM and helps to answercollaborative queries, e.g., identifying combined workflows and contributions ofusers collaborating on a project based on the records of previous workflow ex-ecutions. We also adopt and extend the high-level Query Language for Prove-nance (QLP) with additional constructs, and show how these extensions allownon-expert users to express collaborative provenance queries against this modeleasily and concisely. Furthermore, we adopt the Provenance Challenge 3 (PC3)workflows as a collaborative and interoperable usecase scenario, where differentstages of the workflow are executed in three different workflow environments -Kepler, Taverna, and WSVLAM. Through this usecase, we demonstrate how wecan establish and understand collaborative studies through interoperable work-flow provenance.

1 Introduction

As scientific knowledge grows and the number of studies that require access to knowl-edge from multiple scientific disciplines increase, the complexity of scientific problemsamplifies. To cope with this complexity, scientists use computational methods that are

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"#$$%&'()!%*+,!'&(+,-."+!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!/01213!45607895!:;<=>?3!@6<=A>?!?>B;<>=;<CD!

E;@0<7>750!

*05A<6=C!

F;<GH;I!

,0:;46=;<C!

J<;K05=!

*:>504!

*L><0M!

N>=>!*=;<0!

J<;@07>750!

*=;<0!

%40<!

*:>504!

F;<GH;I!

+726704!F;<GH;I!

+726704!F;<GH;I!

+726704!

Fig. 1. Component architecture of a typical scientific research infrastructure (externaldata, service and computational infrastructure not shown).

evolving almost daily. However, the basic scientific method remains the same while be-ing continuously transformed from manual to automated with the advances in computerscience and technology. This gradual shift from manual process execution to automa-tion of repeatable patterns resulted in the creation of scientific workflow systems.

Scientific workflow management systems [1,2,3,4] are critical to the way a modernscientist conducts studies today by making technological advances more approachablethrough integrative interfaces and abstractions for underlying computational and dataresources. Scientific workflow systems allow scientists to develop formal, customiz-able, reusable and extensible definitions for all or part of a scientific process and executethem efficiently. In addition, using scientific workflows to perform computational ex-periments on data unleashes the possibility to maintain its provenance [5]. Typically de-signed iteratively by a user and ran multiple times by one or more users, the provenanceof a scientific workflow provides a rich source for conducting similar future scientificstudies [6]. However, this is still only a partial solution to the modern scientific processthat relies on multi-disciplinary collaborative teams working on different parts of scien-tific studies. Currently, provenance support in workflow systems are mostly designed tocapture the information related to a single workflow run by a user. On the other hand,the collaborative process often involves design and execution of multiple workflows [7]where different members of a team conceptualize their contribution as workflows andmake it available through a common infrastructure. A scientific discovery is the resultof methodical execution of many of these workflows with many datasets at differenttimes by one or more users.

Community portals [8], virtual laboratories [9], and Web2.0-based social network-ing and sharing environments [10] are popular platforms to establish a common in-frastructure where community members can contribute to data, workflows and projectsthrough their user spaces under generic governance rules. Workflows could be executedmultiple times by one or more scientists, potentially from an end-user interface that

!"#

!$#

%&#

'()*+,-*./#

!0#

12304#2356#

12374#23"6#

123$6#

89:.(;#

<:-:#8-*.(#

!5#1;06#

1;74#;56#

!=#

>+(.+#)!?@,+9,AB#;:-:#

>+(.+#)!?@,+9,AB#2*.CD*2+#

(a)

!"#

!$#

%&#

'()*+,-*./#

!0#

12304#2356#

12374#23"6#

123$6#

89:.(;#

<:-:#8-*.(#

!5#1;06#

1;74#;56#

!=#

>+(.+#)!?@,+9,AB#;:-:#

>+(.+#)!?@,+9,AB#2*.CD*2+#

(b)

!"#

!$#

!%#

!&#

!'#

!(#

)*+"#

,-"./0)01"#

!"2#

!3#

!4#*"#

*5#

*$#!5#

)*+$#

,-$./0)01"#

)*+%#

,-5./0)01"#

)*+&#

,-'./0)01"#

)*+5#

,-%./0)01"#

)*+'#

,-5./0)01$#

!&#

!'#

!(#

678,#9)0/:#8-#,8);<8,#)*+=#>+#!>?@)@+A#*=@)#=/0B@=#

(c)

Fig. 2. Different observables of shared data, workflows and their runs in a typicalscientific research project: (a) data ({d1,d2,d3}) published by users in {u1,u6}; (b)ready to run workflows ({w f1.. w f5}) published by users in {u2,u4,u5}; (c) flow graphfor published workflow runs (customized through their parameters) and related data({d1.. d10}) in user spaces ({u1,u2,u3}), separated by horizontal dashed lines.

combines several workflows. In addition, the executed workflows use data from exter-nal data resources and the scientific outputs are saved in data repositories, optionallyalong with intermediate results and the process provenance. A typical set of compo-nents for such an infrastructure is illustrated in Fig. 1.

The discussion in this paper lies in the heart of these sharing and execution prac-tices in e-science projects where the overall execution of a set of workflows can re-sult in an overarching model of the scientific process leading to data artifacts. All theobservables in such a project within a three- dimensional space of users, workflowsand data are illustrated in Fig. 2. The history of workflow runs in different user spaces

d1

d2

d4

d6

d5

d7 d10

d9

d8

d3

(a)

!"#$%

&'$()*!*+$%

!"#,%

&',()*!*+$%

!"#-%

&'.()*!*+$%

!"#/%

&'0()*!*+$%

!"#.%

&'-()*!*+$%

!"#0%

&'.()*!*+,%

1"#%23)3#23#45%673&%(b)

!"# !$#!%#

!&#

!'#

!(#

)*+,#-.//01.,02.3#45+6#

78.39:,03*52;+<=#

(c)Fig. 3. Different views generated by modeling and analysis of runs in a typical scientificresearch project: (a) data dependency view; (b) run dependency view; (c) overall non-transitive directed implicit user ({u1.. u6}) collaboration view.

{u1,u2,u3} depicted in Fig. 2(c) shows the usage of published data in Fig. 2(a) andworkflows Fig. 2(b). Users who performed the workflow runs or used published datastart an “implicit collaboration”. In Fig. 2(c), a run node identifies the provenance ofa previous workflow run and the fine-grained data dependencies are shown by dashedlinks between data nodes. One can identify the flow of workflow executions leading toa data artifact that is published as a “scientific discovery” by chaining together the runsperformed by users.

A goal of this approach is to extend the current single-workflow and single-usertargeted provenance approach to a number of workflow runs within a controlled envi-ronment such as a website community portal for sharing data and workflows. In thispaper, we assume that the data store is publicly shared between users. Using this ex-tended information, one can generate views of data dependencies, related workflow exe-cutions and user collaborations, as seen in Fig. 3(a), Fig. 3(b) and Fig. 3(c) respectively.In addition, it becomes possible to answer queries for potential acknowledgements ofa scientific result and the correction trail of a faulty data item. Another important goalis to propose and demonstrate an architecture that facilitates the interoperability of dif-ferent workflow systems through provenance of workflow runs and related data. We

assume the model of provenance is shared between different workflow systems andprovides a global repository of data artifact identifiers, i.e., an artifact produced by oneworkflow system and consumed by another can be uniquely identified. The design ofthis repository is not in the scope of this paper. This new approach puts user actions andcollaborations in the center of the conducted research independent of computationaltechnologies used to generate results.Contributions We investigate the implicit user collaborations in a QLP-based [11]query model that maps to OPM [12] using observables in an e-science infrastructure(Fig. 2) and for generating views on top of them (Fig. 3). This approach links OPMgraphs for workflow runs that have an input or output data dependency and helps toanswer queries such as identifying data connections between workflow runs and contri-butions of users collaborating on a project based on the records of past executions. Weadopt and extend a high-level query language for provenance, QLP, to express complexcollaborative provenance queries. We also establish a mapping between QLP and OPM.Furthermore, through the PC3 (http://twiki.ipaw.info/bin/view/Challenge/) usecase sce-nario, we demonstrate the feasibility of how our approach will lead to development ofsystems that increase interoperability and reusability of workflow results by integratingprovenance coming out of different workflow systems and, in turn, enhancing efficiencyin modern collaborative research.Outline. The organization of this paper is as follows. In Section 2, we introduce the con-cept of collaborative views and queries over interoperable provenance data. Section 3explains QLP and the extensions we build on top of QLP that map to OPM constructsalong with the QLP expressions of queries defined in Section 2. A feasibility study forthe explained techniques is provided in Section 4, based on PC3 workflows. We reviewbackground work in Section 5 and conclude in Section 6.

2 Building Collaborative Views

The lifecycle of scientific workflows—which includes the design, execution, sharing,and management of data and provenance products—depends not only on the workflowitself, but also the overall scientific research infrastructure and scientific collaborationswithin which scientists use these workflows. In this section, we introduce the conceptof collaborative views based on the provenance of workflows and user actions within ascientific infrastructure (see Fig. 2). We also present example queries that are enabledby our provenance model, including those that allow scientists to determine implicitcollaborative relationships. The basic relations we use to develop collaborative viewsare shown in Fig. 4. We first describe these relations, and then show how they enablethe construction of both standard and collaborative provenance views.

The relation Run(r,w) states that r is a run (i.e., execution) of a workflow w (shownusing rounded boxes in Fig. 4). We assume every run is of exactly one workflow. Eachrun r can take zero or more data artifacts din as input according to the relation Input(r,din),which states that din was input to r. Each run r can also have zero or more data artifactsdout as outputs according to the relation Output(r,dout), which states that dout was anoutput of r. A data artifact can be an output of at most one run, but can be used as aninput to zero or more runs.

!"#$%

&'()*+!+,-%

"+%

".%

/012% 3".4$0512%

31!'6!,12%

/01!%7644+.6!+86#0%$#'1!!12%'!6,%+44%1#8-9%!14+86#05$*0:%%

;"+%7644+.6!+-12%&$-5%"7%.17+"01%"+%"012%2#%$#%!"#<=%+#2%2#%&+0%*!62"712%.9%"7%$#%!"#$:>%

;"+%7644+.6!+-12%&$-5%"2%.17+"01%&'(%*".4$0512%.9%"2%&+0%1(17"-12%.9%"+%$#%!"#$:>%

;"+%7644+.6!+-12%&$-5%".%.17+"01%2?%*".4$0512%.9%"2%&+0%1(17"-12%.9%"+%$#%!"#$:>%

"+".)%*".4$0512%2+-+%"012%

"+"7)%!"#@*!62"712%2+-+%"012%

/+"2)%*".4$0512%&6!?A6&%"012%

2?%/012%

2#%

!"#<%

&'9)*+!+,B%

"7%

31!'6!,12%

3!62"712%

"2%

&'(%3".4$0512%

2,%

3!62"712%

C+-+D644+.%

E"#D644+.%

FGD644+.%

Fig. 4. The main entities and edges of the collaborative provenance model.

Although not shown directly in Fig. 4, we assume a relation DerivedFrom(dout ,r,din)for capturing causal dependencies between input and output data items of a run r. Givena fact DerivedFrom(dout ,r,din), we say that dout was derived from r using din. Eachderivation also implies that din was an input to r, i.e., Input(r,din), and dout was anoutput of r, i.e., Output(r,dout). This constraint is captured in first-order logic (FO) as

∀din,r,dout (DerivedFrom(dout ,r,din)→ Input(r,din)∧Output(r,dout)).

We define the relation DDep(dout ,din) as the set of all immediate data dependenciesgiven by DerivedFrom, where dout is said to depend on din. We can easily computeDDep using the following Datalog rule.

DDep(dout ,din) :- DerivedFrom(dout ,r,din).

We write DDep∗ to denote the reflexive and transitive closure of the DDep relation.The Used and Produced relations (as shown in Fig. 4) are variants of Input and

Output that additionally imply a derivation relationship. These relations are defined asviews over Input and Output using the following Datalog rules.

Used(r,din) :- DerivedFrom(dout ,r,din).Produced(r,dout) :- DerivedFrom(dout ,r,din).

The first rule states that a data artifact din was used by a run r if din derived an outputdout of r. The second rule states that a data artifact dout was produced by a run r if it was

derived by an input din of r. Note that these relations do not explicitly link the inputsand outputs of a derivation, which is only done through the DerivedFrom relation.

We define the relation RDep(r2,r1) as the set of all immediate run dependencies,where r2 is said to depend on r1. The RDep relation is defined as the following view inDatalog.

RDep(r2,r1) :- Output(r1,d1),Used(r2,d1).

Specifically, a run dependency is established between a run r2 and r1 whenever theoutput of r1 is used by r2 to derive a data artifact.

We assume the relation Published(u,w) records the case when a user u published aworkflow w to the workflow repository; and similarly, Published(u,d) records the casewhen u published a data artifact d to the shared data store (see Fig. 2). A user u mayalso perform (i.e., execute and then publish) a workflow run r, which is captured by therelation Performed(u,r). When a user performs a run, we assume all outputs of the runare published to the data store, which is captured by the following FO constraint.

∀u,r,d (Performed(u,r)∧Output(r,d)→ Published(u,d)).

As shown in Fig. 4, we consider three variants of collaboration, which we define asviews using the following Datalog rules.

WFCollab(u2,u1) :- Published(u1,w),Run(r,w),Performed(u2,r).

DataCollab(u2,u1) :- Published(u1,d1),DDep∗(d2,d1),Used(r,d2),Performed(u2,r).

RunCollab(u2,u1) :- Performed(u1,r1),Output(r,d1),DDep∗(d2,d1),Used(r2,d2),Performed(u2,r2).

The first rule states that a workflow collaboration (WFCollab) is established betweentwo users whenever the first user publishes a workflow that is executed by the sec-ond user. The second rule states that a data collaboration (DataCollab) is establishedbetween two users whenever the first user publishes a data artifact that is directly or in-directly (through zero or more derivation relationships) used by a run that is performedby the second user.1 The third rule states that a run collaboration (RunCollab) is estab-lished between two users whenever the first user performs a run in which the output ofthe run is used (again, either directly or indirectly through one or more derivations) ina run performed by the second user. Note that it is easy to show that a run collaborationimplies a data collaboration, since performing a run implies publishing each data itemthat is output by the run.

A collaboration dependency CDep(u2,u1) between two users is established when-ever they participate in one of the three collaborations defined above (where u2 dependson u1). The CDep relation is easily defined in Datalog as follows.

CDep(u2,u1) :- WFCollab(u2,u1).CDep(u2,u1) :- DataCollab(u2,u1).CDep(u2,u1) :- RunCollab(u2,u1).

1 Here we use the fact that DDep∗ is reflexive.

Table 1. Example queries across workflow executions and collaborations

Q1 Which data artifacts were used explicilty or implicitly to generate data artifact d?

Q2 Which runs were used in the generation of a data artifact d?

Q3 If data artifact d is detected to be faulty, which runs were affected by d?

Q4 Which users depended on data artifact d?

Q5 Which user collaborations were involved in the derivation of artifact d2 from artifact d1?

Q6 Who are the potential acknowledgements for a publication of a data artifact d?

Each of the views shown in Fig. 3 can be reconstructed from the provenance modeldescribed here. The relations DDep and RDep defined above can be used to constructthe standard data and run dependency graphs shown in Fig. 3(a) and 3(b), respectively.More importantly, using the CDep relation, we can also construct user collaborationviews (i.e., the collaboration dependency graph) as in Fig. 3(c). With these three de-pendency graphs, it becomes possible to answer both standard provenance queries aswell as queries that involve user collaborations. In the following section we extend themodel presented here (with respect to the three dependency graphs) to addtionally sup-port lineage-based path queries. Our approach provides a simple mechanism for filteringdependency graphs to answer provenance queries such as those in Table 1.

3 Expressing Collaborative Queries

We use QLP (the Query Language for Provenance) [11] for expressing lineage queries,and in particular, to filter the dependency graphs described in Section 2. In general, an-swering standard provenance questions (including those of Table 1) requires the genera-tion of recursive queries over lineage graphs. Such queries are often complex to expressand expensive to evaluate [12,13,14,15]. QLP provides a simple, declarative, path-basedlanguage (similar, e.g., to XPath) for expressing such queries, and optimization tech-niques have been developed that make answering QLP queries over large provenancerepositories feasible [16]. QLP queries work over sets of lineage edges (e.g., representedby the DerivedFrom relation). A QLP path query p can be viewed as a filter that selectsmatching paths within the lineage graph induced by the underlying edges. The result ofa QLP query is the set of edges along matching paths of the induced graph. Thus, QLPis a closed language that returns a subset of a given set of lineage edges. Closed lan-guages such as QLP have a number of benefits including the ability to construct views,“incremental” querying, and visualization [17].

QLP queries are expressed and evaluated against a selected provenance view, whichcan be a single workflow run, the entire repository of runs, or the provenance viewresulting from a previous query. In the collaborative provenance query scenario, userscan use QLP expressions to filter the various dependency views of Fig. 3. Below wepresent the basic constructs of QLP and show how QLP can be used to filter dependencygraphs (and subsequently answer the queries of Table 1; see Table 2).

Lineage-preserving path queries (examples)* .. en Lineage graph that resulted in nodes in en.en .. * Lineage graph for nodes derived from nodes in enen1 .. en2 Lineage graph for paths from nodes in en1 to nodes in en2

en1 .. ri .. en2 Lineage graph for paths from nodes in en1 to en2 passing through run riFunctions over lineage path queries

exists(p) True if the selected view contains a path defined by path query pruns(p) The runs of the lineage graph returned by path query pworkflows(p) The workflows of the lineage graph returned by path query partifacts(p) The data nodes of the lineage graph returned by path query pinputs(p) The source nodes of the lineage graph returned by path query poutputs(p) The sink nodes of the lineage graph returned by path query p

Views over lineage path queriesDATA-DEP(p) Data dependencies (Fig. 3(a)) of the lineage graph selected by path query pRUN-DEP(p) Run dependencies (Fig. 3(b)) of the lineage graph selected by path query pCOLLAB-DEP(p) Collaborations (Fig. 3(c)) of the lineage graph selected by path query p

Fig. 5. Basic QLP constructs and functions, where en is a node expression comprised ofeither a data artifact identifier, a run identifier, a data artifact type (denoting the set ofartifacts having that type), or a workflow (denoting the set of runs of the workflow). Weuse p to denote a QLP path query, and ri to denote a run.

In the scenario illustrated by Fig. 2, the lineage information is recorded at a “coarse-grain” level, where only the lineage relationships between inputs and outputs of a runare stored. In the following, we restrict the underlying lineage model of QLP to beover workflow runs, as opposed to the standard use of QLP that supports queries overindidvidual processes within runs (thus modeling lineage at a “fine-grain” level).

Table 5 introduces some of the basic constructs and functions of QLP, together withthe extensions described here, including the DATA-DEP, RUN-DEP, and COLLAB-DEPfunctions. As a simple example of a QLP path query, the expression “* .. d7” returnslineage edges denoting paths starting from any node in the lineage graph and ending atnode d7. Similarly, the query “d2 .. *” returns lineage edges denoting paths starting atnode d2 and ending at any node in the lineage graph. Both “ends” of a path can be fixedin QLP, e.g., the query “d5 .. d9” returns all edges on paths in the lineage graph that startat d5 and end at d9. QLP queries can restrict paths to include intermediate objects, e.g.,the query “#r2 .. d6 .. #r5 .. *” returns the set of lineage edges denoting paths that startat run r2, go through artifact d6 followed by (via one or more lineage edges) run r5, andend at any node.

3.1 Filtering Dependency Views using QLP

The DATA-DEP, RUN-DEP, and COLLAB-DEP functions construct data, run, and collab-oration dependency graphs, respectively, that result from evaluating a QLP query overthe current provenance view. Thus, these functions, unlike the DDep, RDep, and CDeprelations defined in Section 2, create views purely out of lineage relations.

Filtering Data Dependency Views. We write v(p) to denote the set of lineage edgesof the form 〈d1,r,d2〉 ∈ L returned after evaluating a QLP path query p over a set oflineage edges L [16]. We directly use this evaluation to define the DATA-DEP functionas follows.

DATA-DEP(p) := {〈d2,d1〉 | ∃r : 〈d2,r,d1〉 ∈ v(p)}

As shown in Table 2, we can use the DATA-DEP function to answer Q1 of Table 1,which returns the subset of the data-dependency graph that ends at artifact d. Note thatthe DATA-DEP function computes a subset of the DDep relation restricted to lineageedges.Filtering Run Dependency Views. Similarly, to construct a filtered run-dependencygraph, we again use the evaluation function as follows.

RUN-DEP(p) := {〈r2,r1〉 | ∃d1,d2,d3 : 〈d3,r2,d2〉 ∈ v(p)∧〈d2,r1,d1〉 ∈ v(p)}

Note that each output of a run within a lineage graph returned by a QLP query is re-quired to be dependent on some input (since only derivation edges are considered).Thus, the run dependencies returned by the RUN-DEP function have the additional con-straint that each output is dependent on some input (within the query result) of the run.This can be viewed as restricting the RDep relation to only selecting from Producededges instead of Output edges. We can use the RUN-DEP function to answer Q2 and Q3of Table 1, as shown in Table 2.Filtering User Collaboration Views. Let DATA-DEP∗(p) be the set of edges of thereflexive and transitive closure of the edges returned by DATA-DEP(p). We define theCOLLAB-DEP function as:

COLLAB-DEP(p) := C-DEPWF(p)∪C-DEPDATA(p)∪C-DEPRUN(p),

where the functions C-DEPWF , C-DEPDATA, and C-DEPRUN are defined as follows.

C-DEPWF(p) := {〈u2,u1〉 | ∃r,w : r ∈ runs(p)∧Run(r,w)∧Published(u1,w)∧Performed(u2,r)}

C-DEPDATA(p) := {〈u2,u1〉 | ∃d1,d2,d3,r : Published(u1,d1)∧Performed(u2,r)∧〈d2,d1〉 ∈ DATA-DEP∗(p)∧〈d3,r,d2〉 ∈ v(p)}

C-DEPRUN(p) := {〈u2,u1〉 | ∃d0,d1,d2,d3,r1,r2 : Performed(u1,r1)∧〈d1,r1,d0〉 ∈ v(p)∧〈d2,d1〉 ∈ DATA-DEP∗(p)∧〈d3,r2,d2〉 ∈ v(p)∧Performed(u2,r2)}

The COLLAB-DEP function can be used to answer queries Q4-Q6 of Table 1, as shownin Table 2.

3.2 Relation Between the Collaborative Model and OPM

The Open Provenance Model (OPM) [12] has emerged from the e-science community,and has evolved as a standard representation to facilitate the exchange of informationbetween multiple provenance systems. OPM is based on a model and set of inference

Table 2. Example queries expressed using the dependency functions defined for QLP.

Q1 DATA-DEP( * .. d )

Q2 RUN-DEP( * .. d)

Q3 RUN-DEP( d .. * )

Q4 COLLAB-DEP( d .. * )

Q5 COLLAB-DEP( d1.. d2 )

Q6 COLLAB-DEP( * .. d )

rules for directed acyclic provenance graphs, which represent casual dependencies be-tween data products and processes. OPM defines three primary entities (nodes): (1) Ar-tifacts: immutable piece of data; (2) Processes: actions or series of actions performed onor caused by artifacts; and (3) Agents: entities that act as catalysts of processes. OPMalso defines five primary types of causal dependencies (edges) that comprise prove-nance graphs: (1) used: a process used artifact(s); (2) wasGeneratedBy: an artifact wasgenerated by a process; (3) wasTriggeredBy: a process was triggered by another pro-cess(es); (4) wasDerivedFrom: an artifact was derived from another artifact(s); and (5)wasControlledBy: a process was controlled by an agent.

The collaborative model, e.g., as in Fig. 4, roughly contains the same entities andfive causal dependencies of OPM. A lineage (i.e., DerivedFrom) relation is of the form〈dout ,r,din〉 for data nodes (artifacts in OPM) din and dout and run (or processes inOPM) r. For example, in Fig. 3(b) 〈d4,r1,d1〉 is a lineage relation stating that artifact d1was used by the run r1 to produce artifact d4 (i.e., artifact d4 wasDerivedFrom artifactd1), and artifact d4 wasGeneratedBy workflow run r1. Adjacent lineage relations, e.g.,〈d7,r3,d4〉 and 〈d4,r1,d1〉 state that run r3 wasTriggeredBy run r1. Similarly, users inthe collaborative model can be viewed as a form of agents in OPM, where Performededges are similar to wasControlledBy edges in OPM. To the best of our knowledge,OPM does not provide support for recording when users publish data and workflows,which is essential in the collaborative model proposed here for creating the varioustypes of user collaborations described in Section 2.

4 Feasibility Study: Provenance Challenge 3 Workflows

In this section, we explain an adopted variation of the PC3 workflows to demonstrateour approach, show that collaborative queries within this usecase are feasible with ourQLP extensions, and describe an architecture for its implementation.

Usecase. The workflows selected for PC3 are part of an image-processing pipeline inthe Pan-STARRS (http://pan- starrs.ifa.hawaii.edu) project. A next generation panoramictelescope surveys the sky looking for asteroids or comets that may impact the Earth. Thetelescope may generate several TBs of data nightly, which must be reduced and storedinto an object data management framework that is publicly accessible by astronomers.Based on this usecase, the main PC3 workflow ingests CSV files containing readings

!"#$%&'() *&'() +,-.'%,/#)

0#1%#")23$+*45)6'7#"8')

Fig. 6. Conceptual process for the Provenance Challenge 3

from the telescope into an SQL database and the plotting workflow creates histogramsof the ingested data.

To build a collaborative workflow environment, we executed the fragments of thesePC3 workflows in three different workflow systems as shown in Fig. 6. In this scenario,Taverna [18] performs the initialization and pre-loading checks, WS-VLAM [19] loadsthe CSV files into the database and updates the column counts, and Kepler [20] createsthe histograms. We chose this division of the PC3 workflows to evenly and logicallydivide the tasks among the workflow engines.

An example history of observables and actions within this usecase is shown in Table3. In this scenario, all three workflow engines use the same MySQL database whenexecuting their subset of the PC3 workflows. In the pre-load tasks, Taverna verifies thecontents of the input CSV files and creates the tables in the database. Next, WS-VLAMreads the contents of the CSV files into these tables, and verifies the row counts and datavalues. Finally, Kepler produces histograms from these data. For example, the secondrow refers to run1 performed by u2 using w fpreload published by u1. In run1, u2 useddJ062941 as an input and the run produced dJ062941−1 as its output.

Collaborative PC3 Queries. The following are example queries on Table 3 expressedusing the QLP functions defined in Section 3.

Q1. What data contributed to dhistogram?DATA-DEP( * .. dhistogram )

Q2. If dJ062942−2 is determined to be faulty, what other data products may be faulty?DATA-DEP( dJ062942−2 .. * )

Q3. What runs contributed to the generation of dJ062941−2?RUN-DEP( * .. dJ062941−2 )

Q4. Which users contributed workflows that produced dhistogram?COLLAB-DEP( * .. dhistogram )

Table 3. The publish and run observable in interoperable PC3 scenario. The contentsof the table shall be read as follows: e.g., the second row refers to run1 performed byu2 using w fpreload published by u1. In run1, u2 used dJ062941 as an input and the runproduced dJ062941−1 as its output.

u1 Published w fPreload

u2 Performed run1 Used w fPreload Used dJ062941 Produced dJ062941−1

u3 Performed run2 Used w fPreload Used dJ062942 Produced dJ062942−2

u4 Published w fLoad

u5 Published w fVisualize

u2 Performed run3 Used w fLoad Used dJ062941−1 Produced dJ062941−2

u3 Performed run4 Used w fVisualize Used dJ062941−2 Produced dhistogram

u3 Published dhistogram

QLP-based Interoperable Query Framework. Fig. 7 shows the design of an end-to-end framework that can be plugged into any scientific infrastructure with the ability topublish data and workflows, to execute workflows using different workflow engines, tocollect workflow provenance and to express and evaluate QLP queries. In this archi-tecture, workflows use a shared data space with common data identifiers. To generatedata dependency views, using the QLP mapping to OPM, the QLP Querying Enginetransforms users QLP queries into OPM queries. In addition, the same querying en-gine routes the mapped queries to distinct provenance stores using the developed SQL(RDBMS), XQuery (XML) and SPARQL (RDF) interfaces.

5 Related Work

The ideas presented in this paper depend on previous work in scientific workflow prove-nance and collaborative scientific platforms. Below we present the related work.

Provenance in Scientific Workflows. Scientific workflow systems are being used inmany scientific domains, and many approaches have been proposed recently for repre-senting and storing workflow provenance [5,6]. However, most of the existing prove-nance approaches store provenance for a single runs, and do not capture or maintainassociations across runs. The framework described in [7] records associations betweenmultiple related workflow runs. However, our work is based on capturing associationsnot only across workflow runs, but also across users, where users play an active role ofpublishing data, or publishing workflows, or executing workflow runs. Our work cap-tures these associations and establishes user collaboration views based on provenance.

Querying Provenance. Approaches for querying provenance are largely based on phys-ical data representations [14], e.g., relational, XML, or RDF schemas, where users ex-press provenance queries through corresponding query languages, i.e., SQL, XQuery,or SPARQL. Provenance queries often require computing transitive closures over de-pendency relations, and expressing such queries using standard approaches is typically

!"# !$#!%#

&!'"#

()"*+,&,-"#

()"#

&!'$#

()$*+,&,-"#

()$#

&!'%#

()%*+,&,-"#

()%#

./0#

1&,+2"#

./0#

1&,+2$#

./0#

1&,+2%#

34/#3!5&67'8#9'87'5#

34/#3!5&67'8#:';5&),<5#

=>?0@# A04# =>B#

C55DE#;F#D5-F'E;&,;5#;2,;*#

"G#34/#-,+E#;F#./0##

$G#HF&IJF(E#!E5#,#E2,&5D#D,;,#E+,<5#(7;2#<F--F'#D,;,#7D5'KL5&E#

%G#3!5&67'8#5'87'5#&F!;5E#;25#-,++5D#M!5&75E#;F#D7EK'<;#+&FN5','<5#E;F&5E#!E7'8#;25#D5N5OF+5D#@34#

P=>?0@QR#A3!5&6#PA04Q#,'D#@/S=34P=>BQ#7';5&),<5ET##

@2,&5D##

>,;,#@;F&5#

HF&IJF(#@2,&7'8#4,65&#

/&FN5','<5#@;F&,85#4,65&#

HF&IJF(#9U5<!KF'#4,65&#

/&FN5','<5#3!5&67'8#4,65&#

VFOO,WF&,KF'#X7E!,O7Y,KF'#4,65&#

Fig. 7. Architecture for answering collaborative queries

done using recursion or stored procedures [15,16]. Expressing such queries is bothcumbersome and error-prone, and requires considerable user expertise. High-level lan-guages such as QLP provide a separation between the logical provenance model and itsunderlying physical representation, which allows for the use of different representationschemes and additional optimization techniques. Also, QLP is closed under lineagerelations, where answers to lineage queries are sets of lineage dependencies (edges)forming provenance subgraphs, i.e., provenance preserving.

Collaborative Applications in E-Science. Since collaborative research studies requiresubstantial infrastructure, we often see infrastructure projects that facilitate conductinga number of these multi- disciplinary scientific studies for a particular domain, e.g.,Virolab [21], VL-e [9] and CAMERA [8]. In the VL-e project, the WFBus focuses onthe execution of workflows developed in various workflow management systems. Col-laborative views through provenance covers both the execution and provenance aspectsof an aggregate workflow. In the context of the CAMERA project, workflow-relatedscientific products and their provenance are stored in data repositories that are accessi-ble through the project portal, allowing for collaborative views and queries over theseruns. In the ViroLab virtual laboratory, scientific applications are executed as scriptsand their provenance is recorded by collecting events emitted by the GridSpace enginethat executes the experiment scripts. Collaborative views over the provenance of theseexecuting scripts can be captured by explicit reuse of results from previous experiments.An interesting opportunity arises from the support for Scientific Research Objects [22]by myExperiment [10]. Applications such as portal environments can deploy the con-tent of SROs in new ways and collaborative views over them.

6 Conclusion

In this paper, we introduced the concept of collaborative views and queries over interop-erable provenance data in a collaborative scientific research. We adopted and extendeda high-level query language for provenance, QLP, to express complex collaborativeprovenance queries. We also established a mapping between QLP and OPM. Finally, weshowed the feasibility of our approach on collaborative queries through PC3-inspiredusecase workflows and described our planned architecture for its future implementa-tion. The contributions of this paper tie together users actions with multiple workflowexecutions that create a chain of custody for data generated by collaborations.

In the future, we plan to publish an implementation of the QLP based collabo-rative query engine based on the PC3 workflow using workflows in Kepler, Tavernaand WSVLAM. We are currently conducting a larger bioinformatics usecase from theCAMERA project where users share data, workflows and runs through shared stores.We also intend to work on aspects of restricted user spaces and optimization of collab-orative query evaluation.

Acknowledgements. The authors would like to thank for their collaboration to the restof the Kepler, Virolab and CAMERA team members. This research was supported inpart by the NSF SDCI Award OCI-0722079 for Kepler/CORE, the NSF CEO:P AwardNo. DBI 0619060 for REAP, the DOE SciDac Award No. DE-FC02-07ER25811 forSDM Center, the Gordon and Betty Moore Foundation award to Calit2 at UCSD forCAMERA, and the European Commission Grant 027446 for ViroLab.

References

1. Ludaescher, B., Goble, C., eds.: Special section on scientific workflows. Volume 34:3. ACMSIGMOD Record (2005)

2. Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M., eds.: Workflows for e-Science. Springer(2007)

3. Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M.,Moreau, L., Myers, J.: Examining the challenges of scientific workflows. IEEE Computer40 (2007) 24–32

4. Deelman, E., Gannon, D., Shields, M., Taylor, I.: Workflows and e-science: An overview ofworkflow system features and capabilities. Future Generation Computer Systems 25 (2009)528–540

5. Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMODRecord 34 (2005) 31–36

6. Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: A survey.Computing in Science and Engineering 10 (2008) 11–21

7. Bowers, S., McPhillips, T., Wu, M.W., Ludascher, B.: Project Histories: Managing DataProvenance Across Collection-Oriented Scientific Workflow Runs. In: Data Integration inthe Life Sciences. Volume 4544 of Lecture Notes in Computer Science. Springer Berlin /Heidelberg (2007) 122–138

8. Altintas, I., Lin, A.W., Chen, J., Churas, C., Gujral, M., Sun, S., Li, W., Manansala, R.,Sedova, M., Grethe, J.S., Ellisman, M.: Camera 2.0: A data-centric metagenomics commu-nity infrastructure driven by scientific workflows. In: Proceeding of The IEEE 2010 FourthInternational Workshop on Scientific Workflows, Miami, Florida (2010)

9. Zhao, Z., Booms, S., Belloum, A., de Laat, C., Hertzberger, B.: Vle-wfbus: A scientificworkflow bus for multi e-science domains. International Conference on e-Science and GridComputing (2006)

10. Roure, D.D., Goble, C., Stevens, R.: Designing the myexperiment virtual research environ-ment for the social sharing of workflows. In: E-SCIENCE ’07: Proceedings of the ThirdIEEE International Conference on e-Science and Grid Computing, Washington, DC, USA,IEEE Computer Society (2007) 603–610

11. Anand, M.K., Bowers, S., McPhilips, T., Ludascher, B.: Exploring scientific workflow prove-nance using hybrid queries over nested data and lineage graphs. In: SSDBM. (2009)

12. Moreau, L., Freire, J., Futrelle, J., McGrath, R.E., Myers, J., Paulson, P. In: The Open Prove-nance Model: An Overview. Volume 5272 of Lecture Notes in Computer Science. SpringerBerlin / Heidelberg (2008) 323–326

13. Anand, M.K., Bowers, S., Ludascher, B.: A navigation model for exploring scientific work-flow provenance graphs. In: WORKS. (2009)

14. Cohen, S., Cohen-Boulakia, S., Davidson, S.: Towards a model of provenance and userviews in scientific workflows. In Istrail, S., Pevzner, P., M.Waterman, eds.: Data Integrationin the Life Sciences. Volume 4075/2006 of Lecture Notes in Computer Science., Berlin,Heidelberg, Springer Berlin / Heidelberg (2006) 264–279

15. Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: SIGMOD.(2008)

16. Anand, M.K., Bowers, S., Ludascher, B.: Techniques for efficiently querying scientific work-flow provenance graphs. In: EDBT. (2010)

17. Anand, M.K., Bowers, S., Ludascher, B.: Provenance browser: Displaying and queryingscientific workflow provenance graphs (Demo). In: ICDE. (2010)

18. Turi, D., Missier, P., Goble, C., De Roure, D., Oinn, T.: Taverna workflows: Syntax andsemantics. e-Science and Grid Computing, International Conference on 0 (2007) 441–448

19. Korkhov, V., Vasyunin, D., Wibisono, A., Guevara-Masis, V., Belloum, A., de Laat, C., Adri-aans, P., Hertzberger, L.: Ws-vlam: towards a scalable workflow system on the grid. In:WORKS ’07: Proceedings of the 2nd workshop on Workflows in support of large-scale sci-ence, New York, NY, USA, ACM (2007) 63–68

20. Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger-Frank, E., Jones, M., Lee, E.,Tao, J., Zhao, Y.: Scientific workflow management and the kepler system. Concurrency andComputation: Practice and Experience, Special Issue on Scientific Workflows (2005)

21. Malawski, M., Bartynski, T., Bubak, M.: Invocation of operations from script-based gridapplications. Future Generation Computer Systems 26 (2010) 138 – 146

22. De Roure, D., Goble, C.: Research objects for data intensive research. In: E-Science. (2009)

Date post:	04-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Understanding Collaborative Studies Through Interoperable...

Documents