Post on 28-Mar-2015
transcript
Annotated XML: Queries and Provenance
Nate Foster TJ Green Val Tannen University of Pennsylvania
Symposium on Database ProvenanceUniversity of Edinburgh
May 21, 2008
Need to Track XML Provenance• For scientific data processing [Buneman+ 01]– Tree-structured data, heterogeneous sources – XML is the natural data model– Data annotated with source info; annotations need to
be propagated during query processing• For incomplete/probabilistic data [Sen.&Abit. 06]– Query output annotated with Boolean formulas– Annotations indicate correlations between source
data and output data• For data warehousing [Cui+ 00]– Even when data is relational, often have XML views
2
Provenance for Relational Algebra Views
3
A B C
a b c
d b e
f g e
A Ba ca ed cd ef e
V := ¼AB((¼AC(R) ⋈ ¼C(R)) [ (¼AB(R) ⋈ ¼BC(R)))
source Rview V
??
?
Semiring-Annotated Relations [PODS07]
• Associate each tuple in database with an annotation from a commutative semiring (K, +, ¢, 0, 1)
• Combine and propagate annotations during (positive) relational query processing–⋈, £, Å combine annotations using ¢–¼, [ combine annotations using +–¾ multiplies annotations by 0 or 1
4
Annotated Relations Example
5
A B C
a b c p
d b e r
f g e s
RA Ba c 2p2
a e prd c prd e 2r2 + rsf e 2s2 + rs
V
V := ¼AB((¼AC(R) ⋈ ¼C(R)) [ (¼AB(R) ⋈ ¼BC(R)))
Semiring Bestiary
• (B, Ç, Æ, ?, >) Set semantics• (N, +, ¢, 0, 1) Bag semantics• (PosBool(B), Ç, Æ, ?, >) Incomplete dbs• (P(), [, Å, ;, ) Probabilistic dbs• (P(P(X)), [, d, ;, {;}) Why-provenance where A
d B := {a [ b : a 2 A, b 2 B}• (C, min, max, absent, public) Security clearances• (N[X], +, ¢, 0, 1) Prov. polynomials
6
Our Contribution: Annotated XML• We show how to decorate unordered XML data
with semiring annotations: K-UXML • We propagate the annotations for K-UXQuery
(based on a large fragment of positive XQuery)
• We do this by generalizing the semantics of Nested Relational Calculus (NRC) to handle annotated values and to incorporate a recursive tree type and structural recursion on trees
• We prove a commutation with homomorphisms theorem, and show that it enables applications in security and incomplete databases
7
K-UXML
• No attributes, no text values, no repeated children (inessential); no order (essential!)
• Each node decorated with a value k from semiring K (1 “neutral,” 0 “not present”)
• K-collection: a finite set of elements annotated with values from K
• Formally, the children of a node form a K-collection of subtrees (to annotate root, also have a top-level K-collection)
8
Example: XPath on K-UXML
9
a
bx1
cy3
cy1
a d
a
cy2 bx2
d
Source, $T:
r
cx1¢y3 + y1¢y2 cy1
d
a
cy2 bx2
Answer:
Query: element r { $T//c }
Omitted annotations are 1 (and omitted subtrees have annotation 0)
Example: For-Loops in K-UXQuery
10
az
bx1 cx2
dy1 dy2 ey3
Source, $S: Answer:
Query: element p { for $t in $S return for $x in ($t)/¤ return ($x)/¤ }(i.e., element p { $S/¤/¤ })
p
d z¢x1¢y1 + z¢x2¢y2 e z¢x2¢y3
Outline of Technical Approach
• Extend NRC with a recursive tree type– satisfies: tree = label £ { tree }
and an operation for structural recursion on trees (srt) [Robertson+ 07]– apply to each child subtree, collect results using
NRC big union• Generalize NRC + srt to handle semiring-
annotated complex values ) NRCK + srt• Define semantics of K-UXQuery by translation
to NRCK + srt11
Semantics of Small Union
• Sums annotations«e1 [ e2¬K (x) := «e1¬K (x) + «e2¬K (x)
• Example:
12
ax
by
ax
by
ax
bz
,
Query: return ($S, $T) (in NRC: $S [ $T)
a2x
by
ax
bz
,
Source: Answer:
Semantics of Big Union
• Sums and multiplies annotations
«[(x 2 e1) e2¬K (y) := «e1¬K (ai) ¢ «e2¬K[x := ai]
(y)
where the support (the set of elements with non-zero annotations) of «e1¬K is {a1, ..., an}
13
n
i 1
Big Union Example With K = N
14
Query: return $T/¤/¤ (in NRC: [(x 2 $T) [(y 2 x) { y })
b2
c3
b b
c c cc c cc7
b
c
b
c
Source, $T : Answer:
´ ´c, c, c, c, c, c, c, , ,
XPath Descendant Operator Uses srt
• //¤ applied to forest $T translates to
[(x 2 $T) ¼1((srt(b, s) . f) x)
where
f := let self = Tree(b, [(x 2 s) {¼2(x)} in
let matches = [(x 2 s) {¼1(x)} in
(matches [ {self}, self))• //a, similar to above
15
• Data annotated with clearance levels fromtotal order C : P < C < S < T < 0
• Joint use of data (¢) requires access to both (max of clearances); alternative use of data (+) requires access to either (min of clearances)
• (C, min, max, 0, P) is a commutative semiring
p
d min(max(P,C,C),max(P,C,S)) e max(P,C,T)
Application: Security Clearances
16
p
d C e T
aP
bC cC
dC dS eT Query: element p { $S/¤/¤}
• For any given clearance level (e.g., C), want the following diagram to commute:
Security Condition: Non-Interference
17
pP
dC eT
pP
dC
aP
bC cC
dC dS eT
aP
bC cC
dC
query
query
erase > C erase > C
Application: Incomplete XML
• Data annotated with Boolean expressions; tree T represents set of possible worlds Mod(T)
18
T =
a
b
cy3
cy1
a d
a
cy2 b
da
b
c
c
a d
a
c b
d
Mod(T) =
a
b
a
d
a
b
c
a
d
a
b c
a d
a
b
d
, , ,...,
7 possible worlds
Correctness: Possible Worlds
19
• For every incomplete tree T, and every UXQuery query q, want this diagram to commute:
T Mod(T)
q(Mod(T)) = Mod(q(T))q(T)
q q
Mod
Mod
Commutation with Homomorphisms
• Theorem: Let h : K1 K2 be a semiring homo-morphism. Then for any UXQuery query q, and for any K1-UXML document D, we have h(q(D)) = q(h(D)).
• Ex: security clearanceshc : C C hc(k) := if k · c then k else 0
• Ex: incomplete dbsº : B B Evalº : PosBool(B) B
• Ex: duplicate elimination± : N B ±(k) := if k = 0 then ? else >
20
Related Work
• Bag semantics for NRC [Libkin&Wong 97]
• Incomplete XML [Kanza+ 99, Abiteboul+ 06]
• Probabilistic XML [Nierman&Jagadish 02, van Keulen+ 05, Abit.&Senellart 06, Sen.&Abit. 07, Hung+ 07]
• XML provenance [Buneman+ 01]
• NRC provenance [Hidders+ 07]
• Semiring-annotated XPath [Grahne+ 07]
• Negation, expressiveness of RAK [Geerts&Poggi 08]
21
Conclusion
• We showed how to annotate unordered XML trees (complex values) with values from a commutative semiring K, and propagate those annotations in queries for a large, positive fragment of XQuery (NRC + srt)
• We saw novel applications in security and incomplete dbs, made possible by a fundamental property of our framework, commutation with homomorphisms
22
Future Work
• Practical applications based on framework– Security clearances– Jointly recording provenance, security,
multiplicities, uncertainty, etc. (product of semirings is also a semiring!)
• Query optimization: containment/equivalence wrt annotated semantics depends on K– In paper, we show K-equivalence for UXQuery is
the same as B-equivalence when K is a distributive lattice
23
24
K-UXQuery Syntax
25