+ All Categories
Home > Documents > E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E...

E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E...

Date post: 14-Dec-2015
Category:
Upload: raven-belshaw
View: 213 times
Download: 0 times
Share this document with a friend
71
EVALUATING PROBABILISTIC QUERIES OVER UNCERTAIN MATCHING IEEE INTL. CONFERENCE ON DATA ENGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung, and Jiefeng Cheng
Transcript
Page 1: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

EVALUATING PROBABILISTIC QUERIES OVER UNCERTAIN

MATCHING

IEEE INTL. CONFERENCE ON DATA ENGINEERING 2012

Reynold Cheng, Jian Gong, David Cheung, and Jiefeng Cheng

Page 2: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

22

BACKGROUND: HIDDEN DATABASES

Source DB

Query interface (e.g., web form)

2

…… ……

Location = CentralPrice < 5MSize > 700 ft

Target query

DB instances; hidden from

users

Page 3: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BACKGROUND: SCHEMA MATCHING

3

S: (pname, email-addr, permanent-addr, current-addr)

T: (name, email, mailing-addr, home-addr, office-addr)

correspondence

source attribute

Target attribute

Target schema

Source schema

Schema matching (e.g., from COMA+

+)

Target Query

Page 4: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BACKGROUND: SCHEMA MAPPING

4

S: (pname, email-addr, permanent-addr, current-addr)

T: (name, email, mailing-addr, home-addr, office-addr)

Mapping: a subset

of matching

Target Query

Source Query

Many different

mappings

Better if we can know

their confidence!

Page 5: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC MAPPINGS

A set of h pairs (Mi, Pr(Mi)), where Pr(Mi) is the probability that mapping Mi exists [Gal06, DHY07, CGC10]

5

Querying on these mappings produce

answers with confidenceSimilarit

y score

Bipartitematching on

similarity scores

Page 6: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BASIC QUERY SOLUTION

Example

6

Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’

m1: Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’

“123”, 0.3“456”, 0.3

Page 7: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BASIC QUERY SOLUTION

Example

7

Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’

m1:

Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’

“123”, 0.3“456”, 0.3

m2: “123”, 0.2“456”, 0.2

Page 8: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BASIC QUERY SOLUTION

Example

8

Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’

m1, m2: “123”, 0.5“456”, 0.5

Page 9: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

VARIANTS OF BASIC SOLUTIONS

Enhanced basic (or e-basic): groups identical source queries, and evaluates the distinct ones Much better than basic!

e-MQO: attempts to improve e-basic by applying multi-query optimization [ZLFL07] on distinct source queries Experimentally worse than e-basic, since generating a

good multi-query plan for lots of mappings is expensive

We use e-basic to compare with our new algorithms

9

Page 10: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

CORRESPONDENCE OVERLAP

Probabilistic mappings can have many common correspondences

10

Q-sharing and O-sharing uses this

to improve query efficiency

Page 11: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

QUERY-LEVEL SHARING (Q-SHARING)

If the query for mappings m1 and m2 are identical, only 1 query needs to be issued.

11

Target query: SELECT addrFROM PersonWHERE phone=‘123’

Source query: SELECT oaddrFROM CustomerWHERE ophone=‘123’

m1 and m2

Page 12: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Q-SHARING

Example

12

Target query: SELECT pnameFROM PersonWHERE addr=‘abc’

Partition the mappings

P1: {m1, m2}P2: {m3, m4}P3: {m5}

Only 3 out of 5 mappings are used for query reformulation.

Representative mappings:{m1, m3, m5}

PartitionTree

Page 13: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBLEM OF Q-SHARING

Given a target query, two mappings may share only some query operators, but not all.

13

Target query: SELECT addrFROM PersonWHERE phone=‘123’

Q-sharing does not work!

Page 14: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING

Share query operator evaluation for two mappings with the same correspondence

14

Target query: SELECT addrFROM PersonWHERE phone=‘123’

• m2 and m3 shares the selection condition

• 1. Obtain tuples with ophone=123 for m2 and m3

• 2. For m2, retrieve oaddr; for m3, retrieve haddr

Page 15: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: EXAMPLE

Target query

Probabilistic mappings

15

Page 16: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: EXAMPLE

An execution unit (e-unit) u1 captures the current status of a target query

16

1) Query plan

2) Mapping set

3) next-op

Page 17: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: EXAMPLE

Execution of an e-unit u1

17

For m1 and m2, addr oaddrProcess m1 and m2 in a batch

For m3, m4, and m5, addr haddrProcess m3-m5 in a batch

select next operator (details later)

Page 18: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: EXAMPLE New e-units u2 and u3 are generated The process goes on until no more e-units are

produced

18

Mapping set of u1 is partitioned

Intermediate results are generated

Page 19: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

OPERATOR SELECTION

Method 1: Randomly select the next operator

19

Page 20: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

OPERATOR SELECTION

Method 2: SNF (or Smallest Number of Partition First) chooses a target operator that leads to the fewest mapping partitions

20

Mapped to 3 source attributes, i.e., 3

mapping partitions

4 mapping partitions

Page 21: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

OPERATOR SELECTION

Method 3: SEF (or Smallest Entropy First) chooses a target operator that leads to the lowest entropy

21addr phone

Page 22: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

ADVANTAGES OF O-SHARING

Interleaves query rewriting and operator execution

May not have to consider the whole target query for every mapping, due to empty intermediate result

The current o-sharing solution supports selection, projection, join, MIN, MAX, and SUM operators

22

Page 23: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC TOP-K QUERIES

Query semantic Returns k tuples whose probabilities are the

highest, among those with non-zero probabilities

Our new algorithm can prune non-answers tuples Avoid evaluating the actual probabilities of all

answer tuples This is done by partially expanding the e-units

23

Page 24: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

EXPERIMENTAL SETUP Schemas and data are about purchase orders Source schema: TPC-H

100MB database, with 1M tuples 46 attributes, 8 relations

3 Target schemas provided by COMA++ Excel, Noris, Paragon 48, 66, and 69 attributes

Schema matcher: COMA++ 10 target queries: selection, projection, join,

COUNT, and SUM 100 probabilistic mappings SEF is used for o-sharing

24

Page 25: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

QUERY PERFORMANCE

25

Page 26: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

EFFECT OF QUERY SIZE

26

Page 27: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

OPERATOR SELECTION STRATEGIES

27

SNF is much better than Random, and SEF further improves SNF.

Page 28: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

TOP-K QUERY PERFORMANCE

28

Top-k query could improve the query performance, especially when the query returns a large set of results.

Page 29: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

RELATED WORK

Schema matching Uncertainty is not considered in most existing work Probabilistic schema mapping [Gal06, DHY07]

Uncertain XML schema matching [CGC10, GCC11] Computing and storing of probabilistic XML

mappings Evaluating of probabilistic XML queries

29

Page 30: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Probabilistic mappings can be used to handle uncertainty of schema matching

To efficiently handle table semantics, we examine q-sharing and o-sharing They exploit the correspondences of mappings

that share a query or its query operators We plan to study the use of o-sharing on

other queries (e.g., set difference and recursive queries)

CONCLUSIONS

30

Page 31: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Reynold Cheng (HKU)

URL: http://www.cs.hku.hk/~ckcheng

Email: [email protected]

THANK YOU!

Page 32: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

REFERENCES [CGC10] R. Cheng, J. Gong, and D. Cheung. “Managing uncertainty in XML

schema matching”, in ICDE, 2010

[GCC11] J. Gong, R. Cheng, and D. Cheung. “Efficient Management of Uncertainty in XML Schema Matching”, in VLDBJ, 2011.

[Len02] Lenzerini, “Data integration: a theoretical perspective”, in PODS, 2002

[YP04] Yu et al, “Constraint-based XML query rewriting for data integration”, in SIGMOD, 2004

[DR02] Do et al, “COMA: a system for flexible combination of schema matching approaches”, in VLDB, 2002

[Gal06] Gal, “Managing uncertainty in schema matching with top-k schema mappings”, in J. Data Semantics VI, 2006

[DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in DASFAA,

2007 [Murty86] Murty, “An algorithm for ranking all the assignment in increasing

order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema matching”,

VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”, in

SIGMOD, 2008 [ZLFL07] J. Zhou, P. Larson, J. Freytag, and W. Lehner, “Efficient exploitation of

similar subexpressions for query processing,” in SIGMOD, 2007.

32

Page 33: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC MAPPINGS We assume that the schema matching is

represented by h probabilistic mappings. The probability of each mapping is obtained by

using a bipartite matching algorithm on the similarity scores of correspondences [CGC10]

33

Page 34: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

GENERATING THE TOP-H MAPPINGS Use a h-maximum bipartite matching

algorithm to find the h mappings with the highest scores See [CGC10]

34

• Image elements are inserted to model the absence of correspondence

We use approach 3

Page 35: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC MAPPINGS Find the h mappings with the highest scores,

using a bipartite matching algorithm [CGC10] For each Mi, obtain Pr(Mi) by normalizing Mi’s

score with the sum of scores of the h mappings

35

Score / total

30 /100

20 /100

20 /100

20 /100

10 /100

Page 36: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

TARGET QUERIES

36

Page 37: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BASIC SOLUTIONS

37

e-basic is the best among the simple solutions. We thus compare it with q-sharing and o-sharing.

Page 38: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

OVERLAP OF MAPPINGS

38

Fraction of no. of common

correspondences over no. of

distinct correspondences

Page 39: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

OPERATOR SELECTION STRATEGIES

39

Page 40: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

40

Page 41: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC QUERY EVALUATION

2 ways to reformulate and evaluate a target query.

By-table semantic All tuples in source tables use the same possible

mapping By-tuple semantic

Each tuple in source tables may use a different possible mapping

Details in Appendix B

41

Page 42: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BY-TABLE SEMANTIC

All tuples in source tables use the same possible mapping

The query answers from the mapping Mi have the probability Pr(Mi)

If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability Pr(t) = Pr(M1) + Pr(M2)

42

Page 43: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BY-TABLE SEMANTIC

Example

43

Target query: SELECT mailing-addr from T

When m1 is considered, the query answer: Sunnyvale, 0.5

When m2 is considered, the query answer: Sunnyvale, 0.4 Mountain View, 0.4

When m3 is considered, the query answer: alice@, 0.1 bob@, 0.1

Final query answer (with duplicates removed):Sunnyvale, 0.9 Mountain View, 0.4alice@, 0.1 bob@, 0.1

Page 44: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

BASIC SOLUTION

Evaluate the target query for every possible mapping Mi

The query answers from the mapping Mi have the probability Pr(Mi)

If duplicate removal is enforced, then a tuple t returned by both M1 and M2 has probability Pr(t) = Pr(M1) + Pr(M2)

Very expensive if the no. of mappings,|M|, is huge

44

Page 45: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

A BASIC SOLUTION

Example

45

Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’

m1: Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’

“123”, 0.3“456”, 0.3

Page 46: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

A BASIC SOLUTION

Example

46

Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’

m1:

Source query: SELECT ophoneFROM PersonWHERE oaddr=‘aaa’

“123”, 0.3“456”, 0.3

m2: “123”, 0.2“456”, 0.2

Page 47: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

A BASIC SOLUTION

Example

47

Target query: SELECT phoneFROM PersonWHERE addr=‘aaa’

m1, m2: “123”, 0.5“456”, 0.5

Page 48: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

5 ALGORITHMS FOR COMPARISON

Basic: consider each possible mapping separately

e-basic: first clusters the identical source queries, then evaluate this set of distinct source queries

e-MQO: improve the e-basic by applying multi-query optimization with the set of distinct source queries

Our solutions: q-sharing and o-sharing48

Page 49: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

TARGET QUERY EVALUATION

5 algorithms for querying probabilistic mappings: basic e-basic e-MQO Q-sharing O-sharing

49

Page 50: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Q-SHARING

50

Source query: SELECT oaddrFROM CustomerWHERE ophone=‘123’

Query answer for m1 and m2:aaa, 0.5

Page 51: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

Q-SHARING: ALGORITHM

51

Partition the possible mappings, andfind representative mappings

Evaluate the basic solution on the representative mappings

Probability of a query answer evaluated by a representative mapping

Page 52: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

EFFICIENT MAPPING PARTITIONING

Partitioning is needed for every possible mapping

A partition tree supports Q-sharing by efficiently classifying possible mappings A non-leaf node is a target attribute An edge is a source attribute A leaf node is a partition of mappings

52

Page 53: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PARTITION TREE (1)

Example

53

Target query: SELECT pnameFROM PersonWHERE addr=‘abc’

Initial state

Page 54: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PARTITION TREE

Example

54After m1 is processed

Target query: SELECT pnameFROM PersonWHERE addr=‘abc’

Page 55: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PARTITION TREE

Example

55After m2 is processed

Target query: SELECT pnameFROM PersonWHERE addr=‘abc’

Page 56: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PARTITION TREE

Example

56After m3 and m4 are processed

Target query: SELECT pnameFROM PersonWHERE addr=‘abc’

Page 57: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PARTITION TREE

Example

57Final state

Target query: SELECT pnameFROM PersonWHERE addr=‘abc’

Page 58: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING ALGORITHM

Repeat Select one query operator from target query A target operator under some operator selection

strategies is chosen The operator is reformulated to a source

operator and executed Until all target query operators are

consumed Our current solution supports selection,

projection, join, MIN, MAX, and SUM operators

58

Page 59: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: FRAMEWORK

An e-unit (or execution unit) captures the current status of a target query, which contains: Query plan, which organizes the query

operators not yet executed and the intermediate query results

Mapping set, the mappings that are used to answer the query, and

The next-op, a query operator in the e-unit that will be executed in the next step

59

Page 60: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been

considered

60

Initial e-unit u1

After executingnext-op of u1with m1-m2,empty resultis returned

Another e-unit u3is generated withintermediate answer R3

Page 61: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been

considered

61

After executingnext-op of u3with m3-m4,u4 is generated

Page 62: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been

considered

62

u4 contains onlyone operator.After execution,two sets of resultsR6 and R7 arereturned

u3‘s next-op is executedover m5, which leads toe-unit u5

Page 63: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: FRAMEWORK U-trace: a tree of e-units that have not yet been

considered

63

u5‘s next-op is executedand returns empty result

All e-units are executed. The query evaluation is complete.

Page 64: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: DETAIL

The operator selection strategy Correctness: not all operators are allowed to be

chosen, eg., a selection operator with one attribute

Effectiveness: reduce the overall query evaluation cost by maximize the sharing of computation of operators

64

Page 65: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC TOP-K QUERY

Top-k query evaluation example Assume the following probability, k = 1

65

Node Prob.

u2 0.5

u6 0.2

u7 0.2

u5 0.1

Page 66: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC TOP-K QUERY

Top-k query evaluation example Heap status during the query evaluation

66

Node Prob. Heap LB UB

u2 0.5 - 0 0.5

LB: the lower bound probability of the tuple with the k-th highest probability in the heap

UB: the maximal probability of any tuple not in the heap

Page 67: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC TOP-K QUERY

Top-k query evaluation example Heap status during the query evaluation

67

Node Prob. Heap LB UB

u2 0.5 - 0 0.5

u6 0.2 ta(0.2,0.5) 0.2 0.3

u7 0.2 ta(0.4,0.5), tb(0.2,0.3), tc(0.2,0.3)

0.4 0.1

Each tuple has a upper/lower bound of probability

Page 68: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

PROBABILISTIC TOP-K QUERY

Top-k query evaluation example Heap status during the query evaluation

68

Node Prob. Heap LB UB

u2 0.5 - 0 0.5

u6 0.2 ta(0.2,0.5) 0.2 0.3

u7 0.2 ta(0.4,0.5), tb(0.2,0.3), tc(0.2,0.3)

0.4 0.1

u5 0.1 - - -

ta can be returned as top-1 answer without visit u5, since:1) tb and tc’s upper probability is lower than ta’s lower probability, and2) UB < ta’s lower probability

Page 69: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: DETAIL

The o-sharing algorithm

69

1) find representative mappings, and initialize u-trace

2) query evaluation with u-trace

3) aggregate query results and return

Page 70: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

O-SHARING: DETAIL

The o-sharing algorithm

70

Case 1: no more operator, return query answers

Case 2: empty intermediate result is found, return empty query answers

Case 3: no early-stopa. find next-opb. partition the mapping setc. for each subset of mappings: - computer next-op - generate a new e-unit - recursively process the e-unit

Page 71: E VALUATING P ROBABILISTIC Q UERIES OVER U NCERTAIN M ATCHING IEEE I NTL. C ONFERENCE ON D ATA E NGINEERING 2012 Reynold Cheng, Jian Gong, David Cheung,

FUTURE WORK

How to handle complex and aggregate queries in o-sharing? e.g., set difference, recursive queries, subqueries

Can we do better if we also consider the selectivity information of operators?

How about other kind of schemas? e.g., XML, XMARK

71


Recommended