Post on 31-Dec-2015
description
transcript
1
Queries with Difference on Probabilistic Databases
Sanjeev KhannaSudeepa RoyVal Tannen
University of Pennsylvania
2
Probabilistic Databases
• To model and query uncertain data (sensor networks, information extraction…)
• Possible worlds model– Each possible world W is a standard database
instance, has a probability P[W]– Compact representation D assuming independence
D
a1
a2
a3
a3
b1
b1
b2
b3
0.1
0.5
0.2
0.1
a1
a2
a3
0.3
0.4
0.6
b1
b2
b3
0.7
0.8
0.4
RS
T
3
Query Semantics
• Query Semantics on probabilistic databases:– Apply the query q on each possible world W– Add up the probabilities of the worlds that give
the same query answer A P[q(D) = A] = ∑W : q(W) = A P[W]
• Goal: Efficiently evaluate P[q(D) = A]– Data complexity; want time polynomial in n = |D|
• Can we always efficiently compute P[q(D)]?– NO, in general it is #P-hard
4
b1
b2
b3
u1
u2
u3
0.7
0.8
0.4
b1
b2
b3
0.7
0.8
0.4
a1
a2
a3
a3
b1
b1
b2
b3
v1
v2
v3
v4
0.1
0.5
0.2
0.1
a1
a2
a3
a3
b1
b1
b2
b3
0.1
0.5
0.2
0.1
a1
a2
a3
w1
w2
w3
0.3
0.4
0.6
a1
a2
a3
0.3
0.4
0.6
Introduce event variables for tuples (P[w1] = 0.3, …)
Step 1: Boolean provenance for q(D) [FR ’97, Z ’97] f = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3
Step 2: Compute P[q(D)] = P[f]given P[w1] = 0.3, P[v1] = 0.4, …
Probability
Event variables
Boolean query q():-R(x),S(x, y),T(y)
easy
hard
Query Answering in Two Steps
DR
ST
5
Probability Computation for Positive Queries
• Dichotomy Result [DS ’04, ’07; DSS ’10] Given q as input, we can efficiently decide if q is
– Safe: Safe plans run in poly-time on all instances, or,
– Unsafe: #P-hard, e.g. q() :- R(x) S(x, y) T(y)
• Instance-by-instance approach [SDG ’10, RPT ’11]– Both q and D are given as input – Poly-time algorithm to compute P[q(D)] for special
cases even if q is unsafe
What about queries with difference?
Boolean Provenances for Difference
c1
c1
c2
c3
a1
a2
a3
a2
v1
v2
v3
v4
a1
a2
a3
w1
w2
w3
R T
6
q1(x):- R(x, y), S(y, z)
b1
b2
b1
c1
c2
c3
u1
u2
u3
q2(x):- R(x, y), S(y, z), T(z)b1
b2
u1(v1 + v2) + u3v4
u2v3
b1
b2
u1v1w1 + u1v2w2 + u3v4w2
u2v3w3
b1
b2
(u1(v1 + v2) + u3v4) . (u1v1w1 + u1v2w2 + u3v4w2)
(u2v3) . (u2v3w3)
q = q1 – q2
S
7
Previous Work on Difference
FOR ’11– Framework for exact and approximate
probability computation – But, no guarantee of polynomial running time
In fact, we show in this paper that with difference,
in some cases no approximation exists (unless NP = RP)
How far can we go with difference in poly-time?
8
A Quick Comparison
With difference
• DNF of boolean provenance may be exponential in n
• P[q(D)] may not be approximable
Without difference
• DNF of boolean provenance is poly-size (n|q|)
• P[q(D)] is always approximable (FPRAS)
FPRAS: Fully Polynomial Randomized Approx. Scheme Compute with prob. ≥ ¾ in time polynomial in n, 1/ε
p [(1-ε) P[q(D)], (1+ε) P[q(D)]
9
Our Results
• We study queries of the form q1 – q2 and their generalization
– FPRAS: If q1 is any UCQ, q2 is any safe CQ-
– #P-hardness: Even if both q1 and q2 are safe CQ-
– Inapproximability: Even if q1 is the trivial TRUE query and q2 is a UCQ
• Our FPRAS result extends to a larger class of queries of which q1 – q2 is a special case
[CQ- : Conjunctive queries without self-joins]
10
Difference Rank• Define difference rank (q) of query q recursively
– (R) = 0
– (q1 - q2) = (q1) + (q2) + 1• R – S : rank 1
– (q1 ⋈ q2) = (q1) + (q2)
• (R – S1) ⋈ (R - S2) : rank 2
• (R - T1) ⋈ T2 : rank 1
– (q1 q2) = max ((q1), (q2))
• (R – S1) ⋈ (R - S2) (R - T1) ⋈ T2 : rank 2
– Select, project: rank remains the same
11
FPRAS for queries q with (q) = 1given some conditions hold
(inapproximable for (q) = 1 in general)
12
Steps in FPRAS
• Step 1: Compute boolean provenance of q[D] for any query q with (q) = 1
• Step 2: Write the boolean provenance in a “Probability Friendly Form” (if possible)
• Step 3: FPRAS inspired by Karp-Luby framework
13
Boolean Provenance for Queries q s.t. (q) = 1
Lemma:For any q with (q) = 1, on any D, the provenance
f of q(D) has form
f is poly-size in n = |D|, poly-time computable
...))(()).(( 4321 DNFDNFDNFDNFf
14
Probability Friendly Form (PFF)
If f is in PFF, we can approximate P[f] using Karp-Luby Framework
...))(()).(( 4321 DNFDNFDNFDNFf
...))(()).(( 42 dDNNFbcddDNNFabcf
f is in PFF, if the negated DNF-s can be written in poly-size d-DNNFs (next slide)
15
d-DNNFDarwiche ’01, ’02, DM ’02deterministic - Decomposable Negation Normal
Form
No internal node can have negation
At most one child of a +-node is satisfiable
Children of a .-node do not share variables
+
+1v
1u
1v
2v 2v
In general, can be a DAG
Probability can be computed in linear time
...))(()).(( 42 dDNNFbcddDNNFabcf
16
Karp-Luby Framework
[KL ’83] Given boolean expression DAGs F1, …, Fm
f = F1 + F2 + ... + Fm
P[f] can be computed in poly-time (in m, n)
if in poly-time, i(1) P[Fi] can be computed
(2) it can be checked if a given assignment satisfies Fi
(3) a random satisfying assignment of Fi can be sampled
Well-studied special case: DNF counting, where F1, …, Fm are DNF minterms: f = xyz + xyw + wuv
17
Conditions (1) and (2) hold for PFF
Product of minterm and d-DNNF is another d-DNNF
w2=1, z1=1
+
+1v
1v
2v 2v
121 zwu+
+1v
1u
1v
2v 2v
...))(()).(( 42 dDNNFbcddDNNFabcf
18
Condition (3) also holdsLemma: Generating a random satisfying assignment on a d-DNNF can be done in poly-time
+
+1v
1v
2v 2v
1. Process in reverse topological order
2. Generate a random satisfying assignment bottom up
v2 = 1 v2 = 0
v1 = 0
v1 = 1v2 = 0
v1 = 0, v2 = 0
v1 = 1, v2 = 0
At random
19
Expressibility in PFF
So, if f is in PFF, we can approximate P[q(D)]
But, can we decide in poly-time if some sub-expressions of a boolean expression have poly-size d-DNNFs?
• Not known • But, there are natural sufficient conditions that can be
verified in poly-time
– If certain sub-queries are safe and hence generate read-once expressions [OH ’08]
– If sub-queries generate poly-size OBDDs [JS ’11]
– Extends to instance-by-instance approach (both q, D given)
21
#P-hardness: Steps in the proof
“Hard” query q = q1 – q2
– q1() := R1(x, y1) R2(x, y2) R3(x, y3) R4(x, y4)
– q2() := R1(x1, y) R2(x2, y) R3(x3, y) R4(x4, y)
Counting independent sets in 3-regular bipartite graphs (XZ ’06)
Counting edge covers in bipartite graphs of degree ≤ 4, where the edge set can be partitioned into 4 disjoint matchings
22
Other Related Work
– Semantics of probabilistic query answering• Fuhr-Rollecke ’97, Zimanyi ‘97
– Dichotomy of CQ- ,CQ and UCQ queries• Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10
– Knowledge compilation techniques• Olteanu-Huang ’08, Jha-Olteanu-Suciu ’10, Jha-Suciu ’11, Fink-Olteanu ’11
– Instance-by-instance approach• Sen-Deshpande-Getoor ’10, Roy-Perduca-Tannen ’11
23
Conclusions and Future work
A step towards understanding complexity of exact and approximate computation for queries withdifference operations
Future work– Dichotomy results that classify syntactically difference
queries (similar to positive UCQ)?
– Extending FPRAS to queries with difference rank > 1?
– Experimental evaluation of our algorithms