Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | paolo-missier |
View: | 227 times |
Download: | 0 times |
Granular workflow provenance in Taverna
1
Paolo MissierInformation Management Group
School of Computer Science, University of Manchester, UK
Symposium on Provenance in Scientific WorkflowsSalt Lake City, Oct. 2008
Outline
2
• Collection values in [bioinformatics] workflows are important• Granular provenance over collections: model and issues• Measuring “provenance friendliness” of dataflows• Increasing friendliness of existing dataflows• Extending the Open Provenance Model graph to describe
granular data derivations
• Provenance service architecture - brief description
IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow
QTL -> genes -> Kegg pathways
IPAW'08 – Salt Lake City, Utah, June 2008
Example (Taverna) dataflow
Collections example: from genes to SNPs
4
• See myexperiment.org: http://www.myexperiment.org/workflows/166
Collections example: from genes to SNPs
4
gene -> genomic region
• See myexperiment.org: http://www.myexperiment.org/workflows/166
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
• See myexperiment.org: http://www.myexperiment.org/workflows/166
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
retrieve SNPs in the region
• See myexperiment.org: http://www.myexperiment.org/workflows/166
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
retrieve SNPs in the region
rearrange SNP details
• See myexperiment.org: http://www.myexperiment.org/workflows/166
Collections example: from genes to SNPs
4
gene -> genomic region
extend region
retrieve SNPs in the region
rearrange SNP details
• See myexperiment.org: http://www.myexperiment.org/workflows/166
[ ENSG00000139618 , ENSG00000083093 ]
[[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>,...],[<1,31820153,13,ENSSNP10730823>, <1,31818497,13,ENSSNP10730820>,...] ]
Computational model for collections
5
Depth mismatch between declared / offered type:
type(P4:X1) = s but type(a) = list(s)
type(P4:X2) = type(c) = list(s)
type(P4:X3) = s but type(c) = list(s)
Execution at P4:
Y = (map P1 <(a ⊗ b) , c>) // cross product
Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
Dot product
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
Collections and iterations
6
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809,...>
[23520984, 31786617][16,13]
<16, 23560179,..> [16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
Tracing granular lineage
7
• Provenance traces are most useful when they are granular– trace individual items in a collection– “which geneID is responsible for the presence of SNP
rs169546 in the output?”
• Curse of black box processors:– M-M (many-many) and M-1 (many-one) processors
destroy granularity
Granular lineage I: no loss of precision
8
X1 X2
Y2:l(s)Y1:l(s)
P0
P1 ≡ λ X . X2
P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]
Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X:s
P2
Y:s
Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }
[a1...ai...an] [b1...bi...bm]
[a12+2b1... ai2+2bi ... an2+2bm]
[2b1... 2bj ...2bm][a12... ai2 ...an2]
Cross product
Granular lineage I: no loss of precision
8
X1 X2
Y2:l(s)Y1:l(s)
P0
P1 ≡ λ X . X2
P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]
Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X:s
P2
Y:s
Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }
[a1...ai...an] [b1...bi...bm]
[a12+2b1... ai2+2bi ... an2+2bm]
[2b1... 2bj ...2bm][a12... ai2 ...an2]
Cross product
Granular lineage II: loss of precision
9
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:s
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c... ai2+c ... am2+c]
c[a12... ai2 ...an2]
Granular lineage II: loss of precision
9
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:s
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c... ai2+c ... am2+c]
c[a12... ai2 ...an2]
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
“f is index-preserving”
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
“f is index-preserving”
III: recoverable loss of precision
10
X1 X2
Y2Y1
P0
P1 ≡ λ X . X2
P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2
Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]
Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]
X1:s X2:s
Y
P3
X:s
P1
Y:s
X: l(s)
P2
Y:l(s)
Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }
[a1...ai...an] [b1...bi...bm]
[a12+c1... ai2+ci ... am2+cm]
[a12... ai2 ...an2] [c1...ci...cm]
“f is index-preserving”
lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }
Multi-level nesting and lineage precision
11
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
“f is index-preserving”
“f is index-preserving”
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
“f is index-preserving”
“f is index-preserving”
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
Adding annotations to the original workflow
12
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
[139618, 83093]
“f is index-preserving”
“f is index-preserving”
lineage(CR:result[0,i]) = { geneIdList[0] }lineage(CR:result[1,j]) = { geneIdList[1] }
CR:result[0,i]
CR:result[1,j]
lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }
geneIdList:
Granular lineage: recap
13
• Lineage query model accounts for granular traces over nested collections
• arbitrary nesting levels:– values are trees in general– lineage query identifies the correct sub-trees
• Lineage queries are efficient– recursion problem “compiled away” by query rewriting – (shameless claim - details omitted)
• But:– One single M-* processor can destroy granularity– in some cases annotations are a remedy
Towards provenance-friendly workflows
14
Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions
14
Towards provenance-friendly workflows
1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions
2.Make workflows more provenance friendly:– Add knowledge (static):
• “lightweight annotations” [MBZ+08] -- see IPAW08– Add knowledge (dynamic):
–provenance-active workflow processors– Redesign processors / workflow
• general guidelines, provenance friendly patterns
14
[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
a = [a1, a2]
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
{ P0:Y[1]= a1, P2:X=c, P3:X=e }
Lineage precision: example
15
b = [b1, b2] f
e = [e1, e2]
c = [c1, c2, c3]
d = [d1, d2]
lineage(P4:Y1[1.2.2], {P0, P2, P3}) =
a = [a1, a2]
precision = (1 + .5 + .5) / 3 = 2/3
{ P0:Y[1]= a1, P2:X=c, P3:X=e }
Precision relative to a sub-graph
16
• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables
• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables
O1
I1 I2
O2 O3
!
wi!WI
wi =!
wj!WO
wj = 1
prec(I, WI , O, WO) =!
j:1...|O|
"WO(Oj)
!
Xi(pi)!lin(Oj ,I)
WI(Xi) · len(pi)nl(Xi)
#
Precision relative to a sub-graph
16
• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables
• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables
O1
I1 I2
O2 O3
reach(P, v) =
!1 if v is reachable from P
0 otherwise
impact(P,O) =!
o!O
W (o) · reach(P, o)
Impact of M-* processors on precision
17
O1
I1 I2
O2 O3
Count the number of variables in O that can be reached from P
• weighted sumP
Improving provenance precision
18
• Impact used to prioritize user actions on processors
• Precision used to assess improvement
• add index-preserving annotations
✓illustrated earlier
• refactor M-* processors
• make processors provenance-active
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
s → s
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
139618
<16, 23520984>
s → s
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13] [23560179, 31871809]Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809>
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809>
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
Refactoring M-* → 1-1
19
l(s) → l(s)
l(s) → l(s)
s → s
s → l(s)
s → s
Processor signatures[139618, 83093]
[139618, 83093]
<13, 31871809>
[23520984, 31786617][16,13]
[16,13]<16, 23560179> [23560179, 31871809]
[ <1,23553692,16,rs152451>,...]
[<1,31840948,13,rs169546>,...]
Dot product
139618 83093
<16, 23520984> <13, 31786617>
s → s
IPAW'08 – Salt Lake City, Utah, June 2008
Provenance-active processors
X: l(s) = [a1, a2, a3]
P
Y: s = b
P
X: l(s) = [a1, a2, a3]
Y: l(s) = [b1, b2]
–Passive processors do not contribute explicit provenance info
–provenance-active processors actively feed metadata to the lineage service
Dynamic annotations:
Static annotations:
aggregation f() P is index-preserving
b = X[i] sorting:Y = Π(X)
b = f(X[1]...X[k])
Open Provenance Model
• A graph notation to represent process provenance– independent of the provenance producers– suitable for exchanging provenance across different workflow
systems• State: draft 1.01 (July 2008)
21
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
wasDerivedFrom
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐
wasDerivedFrom
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]wasDerivedFrom
wasDerivedFrom
Mapping to OPM - granularity issue
22
X1 X2
Y2Y1
P0
X:s
P1
Y:s
X:s
P2
Y:s
a b
c d
fe
How can this granular dependency be described for all arbitrary paths p?
Currently cannot be expressed using OPM
P0
P1
P2
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]wasDerivedFrom
wasDerivedFrom
Path mapping rules
23
P1
P2
P3
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]actual lineage
wasDerivedFrom
Static graph structure sufficient to provide this (in Taverna)
But this is only known at query time
(extensional enumeration not an option)
Path mapping rules
23
P1
P2
P3
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]actual lineage
wasDerivedFrom
Static graph structure sufficient to provide this (in Taverna)
But this is only known at query time
(extensional enumeration not an option)
Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage
Path mapping rules
23
P1
P2
P3
a
b
c
dused
usedused
used
wgb
wgb
☐ ☐b[p] d[p’]actual lineage
wasDerivedFrom
Static graph structure sufficient to provide this (in Taverna)
But this is only known at query time
(extensional enumeration not an option)
Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage
Hint: granularity is only determined by depth of the pathAt query time, the Taverna lineage query algorithm encodes a path mapping rule to compute p’ given p
Architecture provenance-active processors
24
Taverna workflow engine provenancemanager
inputs outputs
provenanceinformationrepository
provenanceevents
lineage queryinterface
lin( P:Y, , Psel, E(D))
1. Common content:–processor execution details–binding of input/output variables to values–completion status
externalservices
Architecture provenance-active processors
24
Taverna workflow engine provenancemanager
inputs outputs
provenanceinformationrepository
provenanceevents
lineage queryinterface
lin( P:Y, , Psel, E(D))
1. Common content:–processor execution details–binding of input/output variables to values–completion status
2. Optional content for provenance-active processors:– explicit output → input dependency assertions:
let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O
externalservices
Architecture provenance-active processors
24
Taverna workflow engine provenancemanager
inputs outputs
provenanceinformationrepository
provenanceevents
lineage queryinterface
lin( P:Y, , Psel, E(D))
1. Common content:–processor execution details–binding of input/output variables to values–completion status
2. Optional content for provenance-active processors:– explicit output → input dependency assertions:
let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O
externalservices
p-active API
• Experimental evaluation:– to what extent is granularity a real practical problem?– Quantify provenance friendliness by analysing a large
collection of workflows from myExperiment– Quantify available improvements (i.e. by refactoring)
• Compare collection management in Taverna with other workflow models– can we sucessfully exchange provenance graphs?
• Integration of the provenance service with the new version of Taverna– to be released before end of year
25
Ongoing work