+ All Categories
Home > Education > Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Date post: 10-May-2015
Category:
Upload: wagner-andreas
View: 362 times
Download: 0 times
Share this document with a friend
Description:
Many databases today are text-rich, comprising not only structured, but also textual data. Querying such databases involves predicates matching structured data combined with string predicates featuring textual constraints. Based on selectivity estimates for these predicates, query processing as well as other tasks that can be solved through such queries can be optimized. Existing work on selectivity estimation focuses either on string or on structured query predicates alone. Further, probabilistic models proposed to incorporate dependencies between predicates are focused on the re- lational setting. In this work, we propose a template-based probabilistic model, which enables selectivity estimation for general graph-structured data. Our probabilistic model allows dependencies between structured data and its text-rich parts to be captured. With this general probabilistic solution, BN+, selectivity estimations can be obtained for queries over text-rich graph-structured data, which may contain structured and string predicates (hybrid queries). In our experiments on real-world data, we show that capturing dependencies between structured and textual data in this way greatly improves the accuracy of selectivity estimates without compromising the efficiency.
Popular Tags:
32
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Institute of Applied Informatics and Formal Description Methods (AIFB) Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs Andreas Wagner, Veli Bicer, and Duc Thanh Tran EDBT/ICDT’13
Transcript
Page 1: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Institute of Applied Informatics and Formal Description Methods (AIFB)

Selectivity Estimation for Hybrid Queries over Text-RichData Graphs

Andreas Wagner, Veli Bicer, and Duc Thanh TranEDBT/ICDT’13

Page 2: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

2

Evaluation Results

Selectivity Estimation for Text-Rich Data Graphs

Introduction and Motivation

Page 3: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

3

INTRODUCTION & MOTIVATION

Page 4: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

4

Text-Rich Data-Graphs and Hybrid Queries

Increasing amount of semi-structured, text-rich data:

Andreas Wagner, Veli Bicer, and Duc Thanh Tran

[1] DBpedia – A Crystallization Point for the Web of Data.

[2] http://webdatacommons.org.

Structured data with unstructured texts (e.g., [1]).

Unstructed data annotated with structured information (e.g., [2]).

Structure Text

Page 5: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

5

Text-Rich Data-Graphs and Hybrid Queries (2)

Focus of our work: conjuctive, hybrid queries

TextStructure

?x ?y „keyword“relation attribute

structured query predicates unstructured query predicates

„string“ (query) predicates

Page 6: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

6

Problem Definition

Problem: Efficiently and effectively estimate the result set size for a conjuctive, hybrid query Q.

Decompose problem: sel(Q) = R(Q) * P(Q), [5].

R(Q): upper-bound cardinality for result set.

P(Q): probability for Q having an non-empty result.

Correlation between query predicates (data elements) make approximation of P(Q) hard.

relation?x ?y „keyword“relation attribute

„keyword“„keyword“

relationattribute

attribute

CorrelationsCorrelations

Correlations

Correlations make estimations relying on „indepence assumptions“ error-prone !

[5] Selectivity estimation using probabilistic models.

Page 7: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

7

Contributions

Previous works focuses either on structured or on unstructured query constraints.

We introduce a uniform model (BN+) for hybrid queries:Instance of template-based BN well-suited for graph-structed data.

Extend BN with string synopses for estimation of string predicates.

?x ?y „keyword“relation attribute„keyword“

„keyword“relation

relation

CorrelationsCorrelations

Correlations

- Graph synopses [3]

- Join samples [4]- PRMs [5,6]- …

- Fuzzy string matching [7,8]- Extraction operators [9,10]- …

Page 8: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

8

SELECTIVITY ESTIMATION FOR TEXT-RICH DATA GRAPHS

Page 9: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

9

Preliminaries (1) – Data and Query Model

Data

Query

Class Node

Attribute Value Node

Attribute Edge

Relation Edge

Bag of N-Grams

String Predicate

Relation Predicate

Keyword Node

contains

Entity Node

Page 10: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

10

Preliminaries (2) – Bayesian Networks (1)

Bayesian Network (BN) provides means for capturing joint probability distributions (e.g., P(Q)).

BN comprise network structure and parameters.

Nodes = random variables.Edges = dependencies .

Recall: sel(Q) = R(Q) * P(Q)

Page 11: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

11

Preliminaries (3) – Bayesian Networks (2)

BN comprise network structure and parameters.

Page 12: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

12

Preliminaries (4) – Bayesian Networks (3)

Template-based BNs: templates and template factors [16]. Template is a function Χ(α1,…,αk), and each argument αi is a place-holder to be instantiated to obtain random variables.

Xperson = {Xperson (p1), Xperson (p2), Xperson (p3)}.

Template factors define probability distributions shared by all instantiated random variables of a given template.

Shared by all instantiations of XdirectedBy

Entity skeleton for Xperson = {p1,p2,p3} .

Page 13: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

13

Template-Based BN for Graph-structured Data

We define a templates for each …Attribute a, Xa(α1). Entity skeleton: all entities having attribute a.

Class c, Xc(α1). Entity skeleton: all entities belonging to class c.

Relation r, Xr(α1,α2). Entity skeleton: all pairs of “source” and “target” entities having relation r.

Template representation is compact.

Dynamic partitioning based on entity skeletons.

Advantages

- PRMs [5,6]- …

Template for attribute title.Template for class person.

Template for relation spouse.

Page 14: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

14

Integration of String Synopses (1)

Problem: Large sample space for attribute-based templates.

In order to compactly represent Ω, being a large set of strings, we use string synopses (e.g., [7,8,9,10]).

Intuitively, for an attribute-based template a string synopsis does:a) Decide how to “compactly represent” Ω.

b) Compute probabilities for strings given its compact space.

Some synopses even allow to “guess” probabilities for unknown strings.

- Fuzzy string matching [7,8]- Extraction operators [9,10]- …

Entire n-gram space as Ω.

Page 15: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

15

Integration of String Synopses (2)

In this work, we use n-gram-based synopses [10].

Consider, e.g., top-k n-gram synopsis [10].Compute n-gram counts and store only top-k n-grams.

Probabilities for known n-grams are exact.

Omitted n-grams are estimated based on heuristics using known n-grams.

[10] Selectivity estimation for extraction operators over text data.

Page 16: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

16

Learning of BN+ (1): Structure (1)

Simplify structure via product approximation using trees [11,12].

Fixed Structure Assumption:a) Two templates X1 and X2 are conditionally independent given their

parents, if they do not share a common entity in their skeletons

b) Each class template Xc has no parent.

c) Each relation template Xr is independent of any class template Xc, given its parents.

[11] Approximating discrete probability distributions with dependence trees.

Similar technique has been recently applied for “Lightweight PRMs” [6].

Page 17: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

17

Learning of BN+ (2): Structure (2)

Using fixed structure allows to decompose structure learning: „Local“ correlations between attribute/class (e.g., Xmovie → Xtitle)

Reduce network structure to only capture “most important” correlations via maximal spanning forest.

Relation templates connect different trees.

Overall, network structure is determined by „overlapping“ entity skeletons and fixed structure assumption.

Template Model

Page 18: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

18

Learning of BN+ (3): Parameters

Based on the learned structure, parameters are learned via collecting sufficient statistics (i.e., frequency counts).

Speed up parameter learning via:Using queries to obtain sufficient statistics.

Using caching during structure / parameter learning.

Page 19: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

19

Estimating P(Q) using BN+ (1)

At runtime, templates are instantiated to construct a query-specific ground BN.

Query-specific Ground BN

Query

Template Model

Assignment is a string synopsis element.

Page 20: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

20

Estimating P(Q) using BN+ (2)

Given a query-specific ground BN, we use inferencing to obtain the joint probability P(Q).

“Correction” using string synopsis.

Recall: sel(Q) = R(Q) * P(Q)

Query-specific Ground BN

Page 21: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

21

EVALUATION

Page 22: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

22

Evaluation (1) – Setting

Data: IMDB [14] and DBLP [15].IMDB featured more correlations than DBLP.

Different results between DBLP and IMDB show „relative benefit“.

Queries: recent keyword search benchmarks [13,14] . We employed 54 DBLP queries and 46 IMDB queries.

Systems: We used n-gram-based string synopses [10]:

random samples of 1-grams,

top-k 1-grams,

stratified bloom filters on 1-grams.

String predicates were integrated via (1) independence (ind) or (2) conditional independence (bn) assumption.

[13] Spark2: Top-k keyword query in relational data-bases.

[14] A framework for evaluating database key-word search strategies.

Page 23: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

23

Evaluation (2) – Setting (2)

Synopsis size: Overall synopsis size depends mainly on string synopsis size.

Synopses sizes {2, 4, 20, 40} MByte memory.∈

Metrics:Efficiency: selectivity estimation time.

Effectiveness: multiplicative error [17].

[17] Independence is good: De-pendency-based histogram syno-pses for high-dimensional data.

Page 24: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

24

Evaluation (3) – Effectiveness – IMDB

Page 25: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

25

Evaluation (4) – Effectiveness – DBLP

Page 26: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

26

Evaluation (5) – Efficiency

Page 27: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

27

CONCLUSION

Page 28: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

28

Conclusion

Tackled the problem of selectivity estimation for conjunctive, hybrid queries.

We propose a template-based BN, which is well-suited for graph-structured data.

For string predicates, we further propose the integration of string synopses into this model.

Experiments showed that:If there are correlations between un-/structured data elements the accuracy of selectivity estimation can be greatly improved via BN+.

BN caused no overhead in terms of efficiency.

Page 29: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

29

QUESTIONS

Slides @ Slideshare …

Paper @ www.aifb.kit.edu …

Page 30: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

30

REFERENCES

Page 31: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

31

References

[1] Christian Bizer et al: DBpedia – A Crystallization Point for the Web of Data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, Issue 7, Pages 154–165, 2009.

[2] http://webdatacommons.org/

[3] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. In SIGMOD, pages 275–286, 1999.

[4] J. Spiegel and N. Polyzotis. Graph-based synopses for relational selectivity estimation. In SIGMOD, pages 205–216, 2006.

[5] L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD, pages 461–472, 2001.

[6] K.Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11):852–863, 2011.

[7] S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In ICDE, pages 227–238, 2004.

[8] L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In VLDB, pages 397–408, 2005.

Page 32: Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs

Institute of Applied Informatics and Formal Description Methods (AIFB)

32

References (2)

[9] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033–1044, 2007.

[10] D. Z. Wang, L. Wei, Y. Li, F. Reiss, and S. Vaithyanathan. Selectivity estimation for extraction operators over text data. In ICDE, pages 685–696, 2011.

[11] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3):462–467,1968.

[12] M. Meila and M. Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1–48, 2001.

[13] Y. Luo, W. Wang, X. Lin, X. Zhou, J. Wang, and K. Li. Spark2: Top-k keyword query in relational databases. IEEE Transactions on Knowledge and Data Engineering, 23(12):1763–1780, 2011.

[14] J. Coffman and A. C. Weaver. A framework for evaluating database keyword search strategies. In CIKM, pages 729–738, 2010.

[15] http://knoesis.org/swetodblp/

[16] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.

[17] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199-210, 2001.


Recommended