Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collection-based workflow...

Post on 29-Aug-2014

382 views 0 download

Tags:

description

Missier, P., Paton, N., & Belhajjame, K. (2010). Fine-grained and efficient lineage querying of collection-based workflow provenance. Procs. EDBT. Lausanne, Switzerland.

transcript

Paolo Missier, Norman Paton, Khalid BelhajjameInformation Management Group

School of Computer Science, University of Manchester, UK

EDBT ConferenceLausanne, Switzerland, March 2010

Fine-grained and efficient lineage querying of collection-based workflow provenance

EDBT, Lausanne, March 2010

• Setting:– Black box provenance of workflow data products

• Fine-grained provenance:– tracking provenance through collections: motivation– functional model of collection-oriented workflow

processing

• Efficient query processing:– leveraging the functional model to achieve efficient

processing for a simple query model

• Experimental evaluation

Problem statement and outline

EDBT, Lausanne, March 2010

Computing provenance through a graph• Provenance graph is an unfolding of the workflow graph

structure– large: grows with size of input– lineage queries involve graph traversal

3From: Z. Bao, S. Cohen-Boulakia, S. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific Workflows," Procs. ICDE, 2009.

EDBT, Lausanne, March 2010

Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

EDBT, Lausanne, March 2010

Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

EDBT, Lausanne, March 2010

Main result

• Query the provenance of individual collections elements• But, avoid computing transitive closures on the provenance graph

• potentially very large• Traverse the workflow graph instead -- much smaller• This results in substantial performance improvement for typical queries

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

EDBT, Lausanne, March 2010

Workflow as data integrator

EDBT, Lausanne, March 2010

Workflow as data integrator

QTLgenomicregions

genesin QTL

metabolicpathways(KEGG)

EDBT, Lausanne, March 2010

Workflow as data integrator

QTLgenomicregions

genesin QTL

metabolicpathways(KEGG)

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

EDBT, Lausanne, March 2010

Motivation for fine-grained provenance List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

geneIDs pathways

••

••

••

••

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

EDBT, Lausanne, March 2010

• Setting:– Black box provenance of workflow data products

• Fine-grained provenance:– tracking provenance through collection elements– motivation➡ functional model of collection-oriented workflow

processing

Problem statement and outline

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...

lineage(wi) = vi

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

① ②

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...dd = 0

ad = 1

δ = 1

lineage(wi) = vi

EDBT, Lausanne, March 2010

Functional model for collection processing /1Simple processing: service expects atomic values,receives atomic values

v1 v2 v3

w1 w2

X1 X2

Y1

P

X3

Y2

v = [[...], ...[...]]

w = [[..] ...[...]]

X

Y

Pdd = 0

ad =2

δ = 2

lineage(wii) = vij

③Extension:service expects atomic values,receives input nested list

Simple iteration:service expects atomic values,receives input list

X

Y

P1

X

Y

Pn

v1

w1 wn

vn

v = [v1 ... vn]

w = [w1 ... wn]

v = [v1 ... vn]

X

Y

P ➠

w = [w1 ... wn]

...dd = 0

ad = 1

δ = 1

lineage(wi) = vi

EDBT, Lausanne, March 2010

Functional model /2

v = [[...], ...[...]]

w = [[..] ...[...]] - depth = n-m

X

Y

Pdd = m

ad =n

δ = n-m ≥ 0

The simple iteration modelgeneralises by induction to a generic δ=n-m

EDBT, Lausanne, March 2010

Functional model /2

v = [[...], ...[...]]

w = [[..] ...[...]] - depth = n-m

X

Y

Pdd = m

ad =n

δ = n-m ≥ 0

The simple iteration modelgeneralises by induction to a generic δ=n-m

v = !a1 . . . an"

(evall P v) =

!(P v) if l = 0(map (evall!1 P ) v) if l > 0

This leads to a recursive functional formulation for simple collection processing:

EDBT, Lausanne, March 2010

Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Cross-product involving v1 and v2 (but not v3):

v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product

and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

EDBT, Lausanne, March 2010

Functional model - multiple inputs /3

X1 X2

Y

P

X3v1 = [v11 ... vin] v3 = [v31 ... v3m]

v2 = [v21 ... v2k]

w = [ [w11 ... w1n], ... [wm1 ...wmn] ]

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

Cross-product involving v1 and v2 (but not v3):

v1 ⊗ v3 = [ [ <v1i, v3j> | j:1..m ] | i:1..n ] // cross product

and including v2: [ [ <v1i, v2, v3j> | j:1..m ] | i:1..n ]

lineage(wii) = < v1i, v2, v3j>

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)

EDBT, Lausanne, March 2010

Generalised cross product

Binary product, δ = 1:

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)

EDBT, Lausanne, March 2010

Generalised cross product

!i:1...n(vi, di)

(v, d1)! (w, d2) =

!"""#

"""$

[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0

Generalized to arbitrary depths:

...and to n operands:

Binary product, δ = 1:

a! b = [["ai, bj#]|bj $ b]|ai $ a]

(eval2 P !a, b") = (map (eval1 P ) a # b)

EDBT, Lausanne, March 2010

Generalised cross product

!i:1...n(vi, di)

(v, d1)! (w, d2) =

!"""#

"""$

[[(vi, wj)|wj " w]|vi " v] if d1 > 0, d2 > 0[(vi, w)|vi " v] if d1 > 0, d2 = 0[(v, wj)|wj " w] if d1 = 0, d2 > 0(v, w) if d1 = 0, d2 = 0

Generalized to arbitrary depths:

...and to n operands:

Binary product, δ = 1:

Finally: general functional semantics for collection-based processing

(evall P !(v1, d1), . . . , (vn, dn)")

=

!(P !v1, . . . , vn") if l = 0(map (evall!1 P ) #i:1...n !vi, di") if l > 0

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δ1

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δ2

X1

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1 X2

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] = δk

X1 X2

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Static mapping of output to input values

X1 X2

Y

P

X3

(0,1) (0,1)(1,1)

(0,2)

this leads to a simple mapping rule:

index of an output list value ➔ {index of input values}

Y[i.j] → X1[i], X2[], X3[j]

[i1 . i2 . ... . ik] =

X1 X2 Xk

The iteration structure can be determined statically

dd(X1) = 0, ad(v1) = 1 ➠ δ1 = 1

dd(X2) = 1, ad(v2) = 1 ➠ δ2 = 0

dd(X3) = 0, ad(v3) = 1 ➠ δ3 = 1

EDBT, Lausanne, March 2010

Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

EDBT, Lausanne, March 2010

Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

EDBT, Lausanne, March 2010

Workflow graph as index for query processing

1. traverse the workflow graph (small) rather than the provenance trace (large)2. use static prediction of iterations to trace through collection elements3. “parachute” into the actual trace only at the end

y11

a1 b1

ymn

bman

wv1 vn

... ...

...

...

Provenance graphWorkflow graph

X1 X2

Y

P

X3

y = [ [y11 ... y1n], ... [ym1 ... ymn] ]

X

Y

R

X

Y

Q

[1][n]

[]

[][1]

EDBT, Lausanne, March 2010

From granularity to efficient query processing

Summary so far:

• whenever iterations are involved, we can trace the provenance of individual elements of a processor’s output

• iterations are explained in terms of a functional model and based on list depth discrepancies

• The relationships between output and input indexes are derived using the workflow specification graph (statically)

How about expressivity and efficient processing of lineage queries?

EDBT, Lausanne, March 2010

Lineage query model /1

I - Focusing:Not all processors are interesting:–report lineage only at

specified nodes in the graph

EDBT, Lausanne, March 2010

Lineage query model /2List-structured KEGG gene ids:

[ [ mmu:26416 ], [ mmu:328788 ] ]

[ path:mmu04010 MAPK signaling, path:mmu04370 VEGF signaling ]

[ [ path:mmu04210 Apoptosis, path:mmu04010 MAPK signaling, ...], [ path:mmu04010 MAPK signaling , path:mmu04620 Toll-like receptor, ...] ]

II - Granularity:Trace lineage for individual elements within collections- when possible!

EDBT, Lausanne, March 2010

Lineage query model and language

<pquery> <scope workflow="keggPathways"> <run id="ae1e2b6b-3bc5-4c93-a250-c4dd0210c3b3"/> </scope> <select> <outputPort name="paths_per_gene" index="[1,2]"/> <outputPort name="paths_per_gene" index="[3,4]"/> <outputPort name="commonPathways" index="[1]"/> <processor name="getPathwayDescriptions"> <outputPort name="return"/> </processor> </select> <focus> <processor name="get_pathway_by_genes" /> </focus></pquery>

optionally specifies one or more runs for the target workflow

port values for which lineage is sought:global outputs or processor-qualified

processors where lineage is to be reported- possibly workflow-qualified

workflow scopedefaults to latest run

EDBT, Lausanne, March 2010

Fine granularity + efficient processing• Scalability:

– query time depends on size of workflow graph, not size of provenance graph

– workflow graphs are small, fit in memory, can be indexed easily

• Graceful degradation:– worst case is a completely unfocused query– no worse than other approaches

• Fine-grain answers provided at the same time

EDBT, Lausanne, March 2010

• Assumption:– Black box provenance of workflow data products

✓Fine-grained provenance:– tracking provenance through collection elements– motivation, functional model of collection-oriented

workflow processing

✓Efficient query processing:✓leveraging the functional model to achieve efficient

processing for a simple query model

➡Experimental evaluation

Outline

EDBT, Lausanne, March 2010

Experimental setup - I

20

• Performance evaluation performed on programmatically generated dataflows

– the “T-towers”

parameters:- size of the lists involved- length of the paths- includes one cross product

EDBT, Lausanne, March 2010

Experimental results - II

10 28 50 75 100 150 2000

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

124256

440

700

1000

2024

3257

workflow pre-processing time by graph size

path length l

tim

e (

ms)

1.33% 6.67% 13.33% 20.00% 26.67% 33.33% 40.00% 46.67%

0

10

20

30

40

50

60

70

80

90

100

110

120

130

response times for PP on unfocused queries (l=150)

% of processors in target set

tim

e (

ms)

performance degradation on fully unfocused queries

10 28 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=10

NI

NGQ

PP

path length l

tim

e (

ms)

10 25 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=150

NR

NGQ

PP

path length l

tim

e (

ms)

10 28 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=10

NI

NGQ

PP

path length l

tim

e (

ms)

10 25 50 75 100 150

0

25

50

75

100

125

150

175

200

225

d=150

NR

NGQ

PP

path length l

tim

e (

ms)

Naive traversal of provenance graph

single multi-join query workflow as index(our approach)

10 elements in input list

150 elements in input list

EDBT, Lausanne, March 2010

Summary• A simple lineage query model for Taverna

–grounded in the semantics of collection-oriented processing

–combines fine-grain answers with efficient query processing

• Ongoing work:– space compression, indexing– QLP?– semantic provenance (initial paper submitted)

• Currently part of the Taverna 2.1 release