Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | raven-belshaw |
View: | 213 times |
Download: | 0 times |
EVALUATING PROBABILISTIC QUERIES OVER UNCERTAIN
MATCHING
IEEE INTL. CONFERENCE ON DATA ENGINEERING 2012
Reynold Cheng, Jian Gong, David Cheung, and Jiefeng Cheng
22
BACKGROUND: HIDDEN DATABASES
Source DB
Query interface (e.g., web form)
2
…… ……
Location = CentralPrice < 5MSize > 700 ft
Target query
DB instances; hidden from
users
BACKGROUND: SCHEMA MATCHING
3
S: (pname, email-addr, permanent-addr, current-addr)
T: (name, email, mailing-addr, home-addr, office-addr)
correspondence
source attribute
Target attribute
Target schema
Source schema
Schema matching (e.g., from COMA+
+)
Target Query
BACKGROUND: SCHEMA MAPPING
4
S: (pname, email-addr, permanent-addr, current-addr)
T: (name, email, mailing-addr, home-addr, office-addr)
Mapping: a subset
of matching
Target Query
Source Query
Many different
mappings
Better if we can know
their confidence!
PROBABILISTIC MAPPINGS
A set of h pairs (Mi, Pr(Mi)), where Pr(Mi) is the probability that mapping Mi exists [Gal06, DHY07, CGC10]
5
Querying on these mappings produce
answers with confidenceSimilarit
y score
Bipartitematching on
similarity scores
BASIC QUERY SOLUTION
Example
6
Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’
m1: Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’
“123”, 0.3“456”, 0.3
BASIC QUERY SOLUTION
Example
7
Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’
m1:
Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’
“123”, 0.3“456”, 0.3
m2: “123”, 0.2“456”, 0.2
BASIC QUERY SOLUTION
Example
8
Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’
m1, m2: “123”, 0.5“456”, 0.5
VARIANTS OF BASIC SOLUTIONS
Enhanced basic (or e-basic): groups identical source queries, and evaluates the distinct ones Much better than basic!
e-MQO: attempts to improve e-basic by applying multi-query optimization [ZLFL07] on distinct source queries Experimentally worse than e-basic, since generating a
good multi-query plan for lots of mappings is expensive
We use e-basic to compare with our new algorithms
9
CORRESPONDENCE OVERLAP
Probabilistic mappings can have many common correspondences
10
Q-sharing and O-sharing uses this
to improve query efficiency
QUERY-LEVEL SHARING (Q-SHARING)
If the query for mappings m1 and m2 are identical, only 1 query needs to be issued.
11
Target query: SELECT addrFROM PersonWHERE phone=‘123’
Source query: SELECT oaddrFROM CustomerWHERE ophone=‘123’
m1 and m2
Q-SHARING
Example
12
Target query: SELECT pnameFROM PersonWHERE addr=‘abc’
Partition the mappings
P1: {m1, m2}P2: {m3, m4}P3: {m5}
Only 3 out of 5 mappings are used for query reformulation.
Representative mappings:{m1, m3, m5}
PartitionTree
PROBLEM OF Q-SHARING
Given a target query, two mappings may share only some query operators, but not all.
13
Target query: SELECT addrFROM PersonWHERE phone=‘123’
Q-sharing does not work!
O-SHARING
Share query operator evaluation for two mappings with the same correspondence
14
Target query: SELECT addrFROM PersonWHERE phone=‘123’
• m2 and m3 shares the selection condition
• 1. Obtain tuples with ophone=123 for m2 and m3
• 2. For m2, retrieve oaddr; for m3, retrieve haddr
O-SHARING: EXAMPLE
Target query
Probabilistic mappings
15
O-SHARING: EXAMPLE
An execution unit (e-unit) u1 captures the current status of a target query
16
1) Query plan
2) Mapping set
3) next-op
O-SHARING: EXAMPLE
Execution of an e-unit u1
17
For m1 and m2, addr oaddrProcess m1 and m2 in a batch
For m3, m4, and m5, addr haddrProcess m3-m5 in a batch
select next operator (details later)
O-SHARING: EXAMPLE New e-units u2 and u3 are generated The process goes on until no more e-units are
produced
18
Mapping set of u1 is partitioned
Intermediate results are generated
OPERATOR SELECTION
Method 1: Randomly select the next operator
19
OPERATOR SELECTION
Method 2: SNF (or Smallest Number of Partition First) chooses a target operator that leads to the fewest mapping partitions
20
Mapped to 3 source attributes, i.e., 3
mapping partitions
4 mapping partitions
OPERATOR SELECTION
Method 3: SEF (or Smallest Entropy First) chooses a target operator that leads to the lowest entropy
21addr phone
ADVANTAGES OF O-SHARING
Interleaves query rewriting and operator execution
May not have to consider the whole target query for every mapping, due to empty intermediate result
The current o-sharing solution supports selection, projection, join, MIN, MAX, and SUM operators
22
PROBABILISTIC TOP-K QUERIES
Query semantic Returns k tuples whose probabilities are the
highest, among those with non-zero probabilities
Our new algorithm can prune non-answers tuples Avoid evaluating the actual probabilities of all
answer tuples This is done by partially expanding the e-units
23
EXPERIMENTAL SETUP Schemas and data are about purchase orders Source schema: TPC-H
100MB database, with 1M tuples 46 attributes, 8 relations
3 Target schemas provided by COMA++ Excel, Noris, Paragon 48, 66, and 69 attributes
Schema matcher: COMA++ 10 target queries: selection, projection, join,
COUNT, and SUM 100 probabilistic mappings SEF is used for o-sharing
24
QUERY PERFORMANCE
25
EFFECT OF QUERY SIZE
26
OPERATOR SELECTION STRATEGIES
27
SNF is much better than Random, and SEF further improves SNF.
TOP-K QUERY PERFORMANCE
28
Top-k query could improve the query performance, especially when the query returns a large set of results.
RELATED WORK
Schema matching Uncertainty is not considered in most existing work Probabilistic schema mapping [Gal06, DHY07]
Uncertain XML schema matching [CGC10, GCC11] Computing and storing of probabilistic XML
mappings Evaluating of probabilistic XML queries
29
Probabilistic mappings can be used to handle uncertainty of schema matching
To efficiently handle table semantics, we examine q-sharing and o-sharing They exploit the correspondences of mappings
that share a query or its query operators We plan to study the use of o-sharing on
other queries (e.g., set difference and recursive queries)
CONCLUSIONS
30
REFERENCES [CGC10] R. Cheng, J. Gong, and D. Cheung. “Managing uncertainty in XML
schema matching”, in ICDE, 2010
[GCC11] J. Gong, R. Cheng, and D. Cheung. “Efficient Management of Uncertainty in XML Schema Matching”, in VLDBJ, 2011.
[Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002
[YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004
[DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002
[Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006
[DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA,
2007 [Murty86] Murty, “An algorithm for ranking all the assignment in increasing
order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema matching”,
VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in
SIGMOD, 2008 [ZLFL07] J. Zhou, P. Larson, J. Freytag, and W. Lehner, “Efficient exploitation of
similar subexpressions for query processing,” in SIGMOD, 2007.
32
PROBABILISTIC MAPPINGS We assume that the schema matching is
represented by h probabilistic mappings. The probability of each mapping is obtained by
using a bipartite matching algorithm on the similarity scores of correspondences [CGC10]
33
GENERATING THE TOP-H MAPPINGS Use a h-maximum bipartite matching
algorithm to find the h mappings with the highest scores See [CGC10]
34
• Image elements are inserted to model the absence of correspondence
We use approach 3
PROBABILISTIC MAPPINGS Find the h mappings with the highest scores,
using a bipartite matching algorithm [CGC10] For each Mi, obtain Pr(Mi) by normalizing Mi’s
score with the sum of scores of the h mappings
35
Score / total
30 /100
20 /100
20 /100
20 /100
10 /100
TARGET QUERIES
36
BASIC SOLUTIONS
37
e-basic is the best among the simple solutions. We thus compare it with q-sharing and o-sharing.
OVERLAP OF MAPPINGS
38
Fraction of no. of common
correspondences over no. of
distinct correspondences
OPERATOR SELECTION STRATEGIES
39
40
PROBABILISTIC QUERY EVALUATION
2 ways to reformulate and evaluate a target query.
By-table semantic All tuples in source tables use the same possible
mapping By-tuple semantic
Each tuple in source tables may use a different possible mapping
Details in Appendix B
41
BY-TABLE SEMANTIC
All tuples in source tables use the same possible mapping
The query answers from the mapping Mi have the probability Pr(Mi)
If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability Pr(t) = Pr(M1) + Pr(M2)
42
BY-TABLE SEMANTIC
Example
43
Target query: SELECT mailing-addr from T
When m1 is considered, the query answer: Sunnyvale, 0.5
When m2 is considered, the query answer: Sunnyvale, 0.4 Mountain View, 0.4
When m3 is considered, the query answer: alice@, 0.1 bob@, 0.1
Final query answer (with duplicates removed):Sunnyvale, 0.9 Mountain View, 0.4alice@, 0.1 bob@, 0.1
BASIC SOLUTION
Evaluate the target query for every possible mapping Mi
The query answers from the mapping Mi have the probability Pr(Mi)
If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability Pr(t) = Pr(M1) + Pr(M2)
Very expensive if the no. of mappings,|M|, is huge
44
A BASIC SOLUTION
Example
45
Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’
m1: Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’
“123”, 0.3“456”, 0.3
A BASIC SOLUTION
Example
46
Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’
m1:
Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’
“123”, 0.3“456”, 0.3
m2: “123”, 0.2“456”, 0.2
A BASIC SOLUTION
Example
47
Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’
m1, m2: “123”, 0.5“456”, 0.5
5 ALGORITHMS FOR COMPARISON
Basic: consider each possible mapping separately
e-basic: first clusters the identical source queries, then evaluate this set of distinct source queries
e-MQO: improve the e-basic by applying multi-query optimization with the set of distinct source queries
Our solutions: q-sharing and o-sharing48
TARGET QUERY EVALUATION
5 algorithms for querying probabilistic mappings: basic e-basic e-MQO Q-sharing O-sharing
49
Q-SHARING
50
Source query: SELECT oaddrFROM CustomerWHERE ophone=‘123’
Query answer for m1 and m2:aaa, 0.5
Q-SHARING: ALGORITHM
51
Partition the possible mappings, andfind representative mappings
Evaluate the basic solution on the representative mappings
Probability of a query answer evaluated by a representative mapping
EFFICIENT MAPPING PARTITIONING
Partitioning is needed for every possible mapping
A partition tree supports Q-sharing by efficiently classifying possible mappings A non-leaf node is a target attribute An edge is a source attribute A leaf node is a partition of mappings
52
PARTITION TREE (1)
Example
53
Target query: SELECT pnameFROM PersonWHERE addr=‘abc’
Initial state
PARTITION TREE
Example
54After m1 is processed
Target query: SELECT pnameFROM PersonWHERE addr=‘abc’
PARTITION TREE
Example
55After m2 is processed
Target query: SELECT pnameFROM PersonWHERE addr=‘abc’
PARTITION TREE
Example
56After m3 and m4 are processed
Target query: SELECT pnameFROM PersonWHERE addr=‘abc’
PARTITION TREE
Example
57Final state
Target query: SELECT pnameFROM PersonWHERE addr=‘abc’
O-SHARING ALGORITHM
Repeat Select one query operator from target query A target operator under some operator selection
strategies is chosen The operator is reformulated to a source
operator and executed Until all target query operators are
consumed Our current solution supports selection,
projection, join, MIN, MAX, and SUM operators
58
O-SHARING: FRAMEWORK
An e-unit (or execution unit) captures the current status of a target query, which contains: Query plan, which organizes the query
operators not yet executed and the intermediate query results
Mapping set, the mappings that are used to answer the query, and
The next-op, a query operator in the e-unit that will be executed in the next step
59
O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been
considered
60
Initial e-unit u1
After executingnext-op of u1with m1-m2,empty resultis returned
Another e-unit u3is generated withintermediate answer R3
O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been
considered
61
After executingnext-op of u3with m3-m4,u4 is generated
O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been
considered
62
u4 contains onlyone operator.After execution,two sets of resultsR6 and R7 arereturned
u3‘s next-op is executedover m5, which leads toe-unit u5
O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been
considered
63
u5‘s next-op is executedand returns empty result
All e-units are executed. The query evaluation is complete.
O-SHARING: DETAIL
The operator selection strategy Correctness: not all operators are allowed to be
chosen, eg., a selection operator with one attribute
Effectiveness: reduce the overall query evaluation cost by maximize the sharing of computation of operators
64
PROBABILISTIC TOP-K QUERY
Top-k query evaluation example Assume the following probability, k = 1
65
Node Prob.
u2 0.5
u6 0.2
u7 0.2
u5 0.1
PROBABILISTIC TOP-K QUERY
Top-k query evaluation example Heap status during the query evaluation
66
Node Prob. Heap LB UB
u2 0.5 - 0 0.5
LB: the lower bound probability of the tuple with the k-th highest probability in the heap
UB: the maximal probability of any tuple not in the heap
PROBABILISTIC TOP-K QUERY
Top-k query evaluation example Heap status during the query evaluation
67
Node Prob. Heap LB UB
u2 0.5 - 0 0.5
u6 0.2 ta(0.2,0.5) 0.2 0.3
u7 0.2 ta(0.4,0.5), tb(0.2,0.3), tc(0.2,0.3)
0.4 0.1
Each tuple has a upper/lower bound of probability
PROBABILISTIC TOP-K QUERY
Top-k query evaluation example Heap status during the query evaluation
68
Node Prob. Heap LB UB
u2 0.5 - 0 0.5
u6 0.2 ta(0.2,0.5) 0.2 0.3
u7 0.2 ta(0.4,0.5), tb(0.2,0.3), tc(0.2,0.3)
0.4 0.1
u5 0.1 - - -
ta can be returned as top-1 answer without visit u5, since:1) tb and tc’s upper probability is lower than ta’s lower probability, and2) UB < ta’s lower probability
O-SHARING: DETAIL
The o-sharing algorithm
69
1) find representative mappings, and initialize u-trace
2) query evaluation with u-trace
3) aggregate query results and return
O-SHARING: DETAIL
The o-sharing algorithm
70
Case 1: no more operator, return query answers
Case 2: empty intermediate result is found, return empty query answers
Case 3: no early-stopa. find next-opb. partition the mapping setc. for each subset of mappings: - computer next-op - generate a new e-unit - recursively process the e-unit
FUTURE WORK
How to handle complex and aggregate queries in o-sharing? e.g., set difference, recursive queries, subqueries
Can we do better if we also consider the selectivity information of operators?
How about other kind of schemas? e.g., XML, XMARK
71