+ All Categories
Home > Documents > Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems...

Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems...

Date post: 23-Aug-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
51
Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion Semiring Provenance in Relational Databases Foundations, Representation Systems, Implementation Pierre Senellart ÉCOLE NORMALE SUPÉRIEURE RESEARCH UNIVERSITY PARIS 21 February 2017 Oxford University, Information Systems seminar
Transcript
Page 1: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Semiring Provenance in Relational DatabasesFoundations, Representation Systems, Implementation

Pierre Senellart

ÉCOLE NORMALE

S U P É R I E U R E

RESEARCH UNIVERSITY PARIS

21 February 2017Oxford University, Information Systems seminar

Page 2: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

2/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Provenance management• Data management all about query evaluation• What if we want something more than the query result?

• Where does the result come from?• Why was this result obtained?• How was the result produced?• What is the probability of the result?• How many times was the result obtained?• How would the result change if part of the input data was

missing?• What is the minimal security clearance I need to see the

result?• What is the most economical way of obtaining the result?• How can a result be explained in layman terms?

• Provenance management: along with query evaluation,record additional bookkeeping information allowing toanswer the questions above

Page 3: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

3/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Data model

• Relational data model: data decomposed into relations,with labeled attributes. . .

• . . . with an extra provenance annotation for each tuple(think of it first as a tuple id)

name position city classification

John Director New York unclassifiedPaul Janitor New York restrictedDave Analyst Paris confidentialEllen Field agent Berlin secretMagdalen Double agent Paris top secretNancy HR director Paris restrictedSusan Analyst Berlin secret

Page 4: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

3/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Data model

• Relational data model: data decomposed into relations,with labeled attributes. . .

• . . . with an extra provenance annotation for each tuple(think of it first as a tuple id)

name position city classification

John Director New York unclassifiedPaul Janitor New York restrictedDave Analyst Paris confidentialEllen Field agent Berlin secretMagdalen Double agent Paris top secretNancy HR director Paris restrictedSusan Analyst Berlin secret

Page 5: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

3/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Data model

• Relational data model: data decomposed into relations,with labeled attributes. . .

• . . . with an extra provenance annotation for each tuple(think of it first as a tuple id)

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

Page 6: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

4/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Relations and databases

Formally:

• A relational schema R is a finite sequence of distinctattribute names; the arity of R is jRj

• A database schema is a mapping from relation names torelational schemas, with finite support

• A tuple over relation schema R is a mapping from R todata values; each tuple comes with a provenance annotation

• A relation instance (or relation) over R is a finite set oftuples over R

• A database instance (or database) over database schema Dis a mapping from the support of D mapping each relationname R to a relation instance over D(R)

Page 7: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

5/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Queries

• A query is an arbitrary function that maps databases overa fixed database schema D to relations over some relationalschema R

• The query does not consider or produce any provenanceannotations; we will give semantics for the provenanceannotations of the output, based on that of the input

• A query q is monotone if for any two databases D1, D2

over D with D1 � D2, q(D1) � q(D2)

• In practice, one often restricts to specific query languages:• Monadic-Second Order logic (MSO)• First-Order logic (FO) or the relational algebra• SQL with aggregate functions• etc.

Page 8: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

6/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Outline

ProvenancePreliminariesBoolean provenanceSemiring provenanceAnd beyond. . .

Representation Systems for Provenance

Implementing Provenance Support

Conclusion

Page 9: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

7/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Boolean provenance [Imielinski and Jr., 1984]

• X = fx1; x2; : : : ; xng finite set of Boolean events

• Provenance annotation: Boolean function over X , i.e., afunction of the form: (X ! f?;>g)! f?;>g

• Interpretation: possible-world semantics• every valuation � : X ! f?;>g denotes a possible world of

the database• the provenance of a tuple on � evaluates to ? or >

depending whether this tuple exists in that possible world• for example, if every tuple of a database is annotated with

the indicator function of a distinct Boolean event, the set ofpossible worlds is the set of all subdatabases

Page 10: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

8/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example of possible worlds

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

�:t1 t2 t3 t4 t5 t6 t7

> > > > > > >

Page 11: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

8/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example of possible worlds

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

�:t1 t2 t3 t4 t5 t6 t7

> ? > ? > ? >

Page 12: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

9/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Boolean provenance of query results

• �(D): the subdatabase of D where all tuples whoseprovenance annotation evaluates to ? by � is removed

• The Boolean provenance provq;D(t) of tuple t 2 q(D) is thefunction:

� 7!

8<:> if t 2 q(�(D))

? otherwise

Example (What cities are in the Personal table?)name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov

New York t1 _ t2Paris t3 _ t5 _ t6Berlin t4 _ t7

Page 13: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

10/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Application: Probabilistic databases[Green and Tannen, 2006, Suciu et al., 2011]

• Tuple-independent database: each tuple t in a database isannotated with independent probability Pr(t) of existing

• Probability of a possible world D0 � D:

Pr(D0) =Qt2D0 Pr(t)�

Qt2D0nD(1� Pr(t0))

• Probability of a tuple for a query q over D:

Pr(t 2 q(D)) =P

D0�Dt2q(D0)

Pr(D0)

• If Pr(xi) := Pr(ti) where xi is the provenance annotation oftuple ti then Pr(t 2 q(D)) = Pr(provq;D(t))

• Computing the probability of a query in probabilisticdatabases thus amounts to computing Boolean provenance,and then computing the probability of a Boolean function

• Also works for more complex probabilistic models

Page 14: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

11/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example of probability computationname position city classification prov prob

John Director New York unclassified t1 0.5Paul Janitor New York restricted t2 0.7Dave Analyst Paris confidential t3 0.3Ellen Field agent Berlin secret t4 0.2Magdalen Double agent Paris top secret t5 1.0Nancy HR director Paris restricted t6 0.8Susan Analyst Berlin secret t7 0.2

city prov

New York t1 _ t2Paris t3 _ t5 _ t6Berlin t4 _ t7

Page 15: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

11/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example of probability computationname position city classification prov prob

John Director New York unclassified t1 0.5Paul Janitor New York restricted t2 0.7Dave Analyst Paris confidential t3 0.3Ellen Field agent Berlin secret t4 0.2Magdalen Double agent Paris top secret t5 1.0Nancy HR director Paris restricted t6 0.8Susan Analyst Berlin secret t7 0.2

city prov prob

New York t1 _ t2 1� (1� 0:5)� (1� 0:7) = 0:85

Paris t3 _ t5 _ t6 1.00Berlin t4 _ t7 1� (1� 0:2)� (1� 0:2) = 0:36

Page 16: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

12/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

What now?

• How to compute Boolean provenance for practical querylanguages? What complexity?

• Can we do more with provenance?

• How should we represent provenance annotations?

• How can we implement support for provenancemanagement in a relational database management system?

Page 17: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

13/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Outline

ProvenancePreliminariesBoolean provenanceSemiring provenanceAnd beyond. . .

Representation Systems for Provenance

Implementing Provenance Support

Conclusion

Page 18: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

14/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Commutative semiring (K; 0; 1;�;)

• Set K with distinguished elements 0, 1• � associative, commutative operator, with identity 0K :

• a� (b� c) = (a� b)� c

• a� b = b� a

• a� 0 = 0� a = a

• associative, commutative operator, with identity 1K :• a (b c) = (a b) c

• a b = b a

• a 1 = 1 a = a

• distributes over �:

a (b� c) = (a b)� (a c)

• 0 is annihilating for :

a 0 = 0 a = 0

Page 19: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

15/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example semirings

• (N; 0; 1;+;�): counting semiring

• (f?;>g;?;>;_;^): Boolean semiring

• (funclassified; restricted; confidential; secret; top secretg;top secret;unclassified;min;max): security semiring

• (N [ f1g;1; 0;min;+): tropical semiring

• (fBoolean functions over Xg;?;>;_;^): semiring ofBoolean functions over X

• (N[X ]; 0; 1;+;�): semiring of integer-valued polynomialswith variables in X (also called How-semiring or universalsemiring, see further)

• (P(P(X )); ;; f;g;[;d): Why-semiring over X(A dB := fa [ b j a 2 A; b 2 Bg)

Page 20: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

16/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Semiring provenance [Green et al., 2007]

• We fix a semiring (K;0;1;�;)

• We assume provenance annotations are in K

• We consider a query q from the positive relational algebra(selection, projection, renaming, cross product, union; joinscan be simulated with renaming, cross product, selection,projection)

• We define a semantics for the provenance of a tuplet 2 q(D) inductively on the structure of q

Page 21: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

17/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Selection, renamingProvenance annotations of selected tuples are unchanged

Example (�name!n(�city=“New York”(R)))

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

n position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Page 22: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

18/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

ProjectionProvenance annotations of identical, merged, tuples are �-ed

Example (�city(R))

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov

New York t1 � t2

Paris t3 � t5 � t6

Berlin t4 � t7

Page 23: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

19/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

UnionProvenance annotations of identical, merged, tuples are �-ed

Example�city(�ends-with(position;“agent”)(R)) [ �city(�position=“Analyst”(R))

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov

Paris t3 � t5

Berlin t4 � t7

Page 24: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

20/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Cross productProvenance annotations of combined tuples are -ed

Example�city(�ends-with(position;“agent”)(R)) on �city(�position=“Analyst”(R))

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov

Paris t3 t5

Berlin t4 t7

Page 25: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

21/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

What can we do with it?

counting semiring: count the number of times a tuple can bederived, multiset semantics

Boolean semiring: determines if a tuple exists when asubdatabase is selected

security semiring: determines the minimum clearance levelrequired to get a tuple as a result

tropical semiring: minimum-weight way of deriving a tuple(think shortest path in a graph)

Boolean functions: Boolean provenance, as previously defined

integer polynomials: universal provenance, see further

Why-semiring: Why-provenance [Buneman et al., 2001], set ofcombinations of tuples needed for a tuple to exist

Page 26: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

22/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example of security provenance

�city(�name<name2(�name;city(R) on �name!name2(�name;city(R))))

name position city prov

John Director New York unclassifiedPaul Janitor New York restrictedDave Analyst Paris confidentialEllen Field agent Berlin secretMagdalen Double agent Paris top secretNancy HR director Paris restrictedSusan Analyst Berlin secret

city prov

New York restrictedParis confidentialBerlin secret

Page 27: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

23/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Notes [Green et al., 2007]

• Computing provenance has a PTIME data complexityoverhead

• Semiring homomorphisms commute with provenancecomputation: if there is a homomorphism from K to K 0,then one can compute the provenance in K, apply thehomomorphism, and obtain the same result as whencomputing provenance in K 0

• The integer polynomial semiring is universal: there is aunique homomorphism to any other commutative semiringthat respects a given valuation of the variables

• This means all computations can be performed in theuniversal semiring, and homomorphisms applied next

• Two equivalent queries can have two different provenanceannotations on the same database, in some semirings

Page 28: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

24/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Outline

ProvenancePreliminariesBoolean provenanceSemiring provenanceAnd beyond. . .

Representation Systems for Provenance

Implementing Provenance Support

Conclusion

Page 29: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

25/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Semirings with monus [Amer, 1984, Geerts and Poggi, 2010]

• Some semirings can be equipped with a verifying:• a� (b a) = b� (a b)

• (a b) c = a (b+ c)

• a a = 0 a = 0

• Boolean function semiring with ^:, Why-semiring with n,counting semiring with truncated difference. . .

• Most natural semirings (but not all semirings [Amarilli andMonet, 2016]!) can be extended into semirings with monus

• Sometimes strange things happen [Amsterdamer et al., 2011]:e.g, does not always distribute over

• Allows supporting full relational algebra with the noperator, still PTIME

• Semantics for Boolean function semiring coincides withthat of Boolean provenance

Page 30: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

26/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

DifferenceProvenance annotations of diff-ed tuples are -ed

Example�city(�ends-with(position;“agent”)(R)) n �city(�position=“Analyst”(R))

name position city classification prov

John Director New York unclassified t1

Paul Janitor New York restricted t2

Dave Analyst Paris confidential t3

Ellen Field agent Berlin secret t4

Magdalen Double agent Paris top secret t5

Nancy HR director Paris restricted t6

Susan Analyst Berlin secret t7

city prov

Paris t5 t3

Berlin t4 t7

Page 31: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

27/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Provenance for aggregates[Amsterdamer et al., 2011, Fink et al., 2012]

• Trickier to define provenance for queries with aggregation,even in the Boolean case

• One can construct a K-semimodule K �M for each monoidaggregate M over a provenance database with a semiringin K

• Data values become elements of the semimodule

Example (count(�name(�city=“Paris”(R)))

t3 � 1 + t5 � 1 + t6 � 1

Page 32: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

28/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Where-provenance [Buneman et al., 2001]

• Different form of provenance: captures from whichdatabase values come which output values

• Bipartite graph of provenance: two attribute values areconnected if one can be produced from the other

• Axiomatized in [Buneman et al., 2001, Cheney et al., 2009]

• Cannot be captured by provenance semirings [Cheney et al.,2009], because of renaming (does not keep track of relationattributes), projection (does not remember which attributevalues still exist), join (in a join, an output value comesfrom two different input values)

Page 33: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

29/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Outline

Provenance

Representation Systems for Provenance

Implementing Provenance Support

Conclusion

Page 34: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

30/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Representation systems

• In the Boolean semiring, the counting semiring, thesecurity semiring: provenance annotations are elementary

• In the Boolean function semiring, the universal semiring,etc., provenance annotations can become quite complex

• Needs for compact representation of provenanceannotations

• Lower the provenance computation complexity as much aspossible

Page 35: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

31/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Provenance formulas

• Quite straightforward

• Formalism used in most of the provenance literature

• PTIME data complexity

• Expanding formulas (e.g., computing the monomials of aN[X ] provenance annotation) can result in an exponentialblowup

ExampleIs there a city with both an analyst and an agent, and if Paris issuch a city, is there a director in the agency?

((t3 t5)� (t4 t7)) ((t3 t5) t1)

Page 36: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

32/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Provenance circuits [Deutch et al., 2014, Amarilli et al., 2015]

• Use arithmetic circuits (Boolean circuits for Booleanprovenance) to represent provenance

• Every time an operation reuses a previously computedresult, link to the previously created circuit gate

• Allow linear-time data complexity of provenancecomputation when restricted to bounded-treewidthdatabases [Amarilli et al., 2015] (MSO queries for Booleanprovenance, positive relational algebra queries for arbitrarysemirings)

• Formulas can be quadratically larger than provenancecircuits for MSO formulas, (log log)-larger for positiverelational algebra queries [Wegener, 1987, Amarilli et al., 2016]

Page 37: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

33/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Example provenance circuit

t7 t4t5t3

t1

Page 38: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

34/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

OBDD and d-DNNF

• Various subclasses of Boolean circuits commonly used:OBDD: Ordered Binary Decision Diagrams

d-DNNF: deterministic Decomposable Negation NormalForm

• OBDDs can be obtained in PTIME data complexity onbounded-treewidth databases [Amarilli et al., 2016]

• d-DNNFs can be obtained in linear-time data complexityon bounded-treewidth databases

• Application: probabilistic query evaluation in linear-timedata complexity on bounded-treewidth databases (d-DNNFevaluation is in linear-time)

Page 39: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

35/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Provenance cycluits [Amarilli et al., 2017]

• Cycluit (cyclic circuit): arithmetic circuit with cycles

• Well-defined semantics on some semirings where infiniteloops do not matter

• Allows computing provenance in linear-time combinedcomplexity for recursive queries of a certain form(ICG-Datalog of bounded body size [Amarilli et al., 2017],capturing �-acyclic conjunctive queries, 2RPQs, etc.), onbounded tree-width databases

• Related to provenance equation systems and formal seriesintroduced in [Green et al., 2007]

Page 40: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

36/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Outline

Provenance

Representation Systems for Provenance

Implementing Provenance Support

Conclusion

Page 41: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

37/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Desiderata for a provenance-aware DBMS

• Extends a widely used database management system

• Easy to deploy

• Easy to use, transparent for the user

• Provenance automatically maintained as the user interactswith the database management system

• Provenance computation benefits from query optimizationwithin the DBMS

• Allow probability computation based on provenance

• Any form of provenance can be computed: Booleanprovenance, semiring provenance in any semiring (possibly,with monus), aggregate provenance, where-provenance, ondemand

Page 42: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

38/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

ProvSQL: Semiring provenance within PostgreSQL I

• Lightweight extension/plugin for PostgreSQL � 9:4

• Provenance annotations stored as UUIDs, in an extraattribute of each provenance-aware relation

• A provenance circuit relating UUIDs of elementaryprovenance annotations and arithmetic gates stored astable

• All computations done in the universal semiring (moreprecisely, with monus, in the free semiring with monus)

• Query rewriting to automatically compute outputprovenance attributes in terms of the query and inputprovenance attributes:

• Duplicate elimination (DISTINCT, set union) results inaggregation of provenance values with �

Page 43: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

39/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

ProvSQL: Semiring provenance within PostgreSQL II

• Cross products, joins results in combination of provenancevalues with

• Difference rewritten in a join, with combination ofprovenance values with

• Additional circuit gates on projection, join for support ofwhere-provenance

• Probability computation from the provenance circuits

Page 44: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

40/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Challenges• Low-level access to PostgreSQL data structures inextensions

• No simple query rewriting mechanism• SQL is much less clean than the relational algebra• Multiset semantics by default in SQL• SQL is a very rich language, with many different ways ofexpressing the same thing

• Inherent limitations: e.g., no aggregation within recursivequeries

• Implementing provenance computation should not slowdown the computation

• User-defined functions, updates, etc.: unclear howprovenance should work

Page 45: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

41/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

PostgreSQL: Current status• Supported SQL language features:

• Regular SELECT-FROM-WHERE queries (aka conjunctivequeries with multiset semantics)

• JOIN queries (regular joins and outer joins; semijoins andantijoins are not currently supported)

• SELECT queries with nested SELECT subqueries in theFROM clause

• GROUP BY queries (without aggregation)• SELECT DISTINCT queries (i.e., set semantics)• UNION’s or UNION ALL’s of SELECT queries• EXCEPT queries

• Probability computation, where-provenance, underdevelopment

• Longer term: aggregate computation• Try it from

https://github.com/PierreSenellart/provsql

Page 46: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

42/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Demonstration

Page 47: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

43/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Outline

Provenance

Representation Systems for Provenance

Implementing Provenance Support

Conclusion

Page 48: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

44/44

Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion

Conclusion• Quite rich foundations of provenance management,including:

• Different types of provenance• Semiring formalism to unify most provenance forms• (Partial) extensions for difference, recursive queries,

aggregation• Compact provenance representation formalisms

• Some theory still missing:• Provenance and updates• Going beyond the relational algebra for full semiring

provenance• Now is the time to work on concrete implementation• Need good implementation to convince users they shouldtrack provenance!

• How to combine provenance computation and efficientquery evaluation, e.g., through tree decompositions?

Page 49: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

Bibliography I

Antoine Amarilli and Mikaël Monet. Example of a naturallyordered semiring which is not an m-semiring.http://math.stackexchange.com/questions/1966858,2016.

Antoine Amarilli, Pierre Bourhis, and Pierre Senellart.Provenance circuits for trees and treelike instances. In Proc.ICALP, pages 56–68, Kyoto, Japan, July 2015.

Antoine Amarilli, Pierre Bourhis, and Pierre Senellart.Tractable lineages on treelike instances: Limits andextensions. In Proc. PODS, pages 355–370, San Francisco,USA, June 2016.

Antoine Amarilli, Pierre Bourhis, Mikaël Monet, and PierreSenellart. Combined tractability of query evaluation via treeautomata and cycluits. In ICDT, 2017.

K. Amer. Algebra Universalis, 18, 1984.

Page 50: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

Bibliography II

Yael Amsterdamer, Daniel Deutch, and Val Tannen. On thelimitations of provenance for queries with difference. InTaPP, 2011.

Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Whyand where: A characterization of data provenance. InDatabase Theory - ICDT 2001, 8th InternationalConference, London, UK, January 4-6, 2001,Proceedings., 2001.

James Cheney, Laura Chiticariu, and Wang Chiew Tan.Provenance in databases: Why, how, and where.Foundations and Trends in Databases, 1(4), 2009.

Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen.Circuits for Datalog provenance. In ICDT, 2014.

Page 51: Semiring Provenance in Relational Databases - Foundations ... · Provenance Representation Systems for Provenance Implementing Provenance Support Conclusion SemiringProvenanceinRelationalDatabases

Bibliography IIIRobert Fink, Larisa Han, and Dan Olteanu. Aggregation in

probabilistic databases via knowledge compilation.Proceedings of the VLDB Endowment, 5(5):490–501, 2012.

Floris Geerts and Antonella Poggi. On database querylanguages for k-relations. J. Applied Logic, 8(2), 2010.

Todd J. Green and Val Tannen. Models for incomplete andprobabilistic information. IEEE Data Eng. Bull., 29(1),2006.

Todd J Green, Grigoris Karvounarakis, and Val Tannen.Provenance semirings. In PODS, 2007.

Tomasz Imielinski and Witold Lipski Jr. Incompleteinformation in relational databases. J. ACM, 31(4), 1984.

Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch.Probabilistic Databases. Morgan & Claypool, 2011.

Ingo Wegener. The Complexity of Boolean Functions. Wiley,1987.


Recommended