Query Processing in Data Integration (Gathering and Using Source Statistics)

Query Processing in Data Integration(Gathering and Using Source Statistics)

Kambhampati & Knoblock

Query Optimization Challenges

Datasource

wrapper

Datasource

wrapper

Datasource

wrapper

Mediator:

User queriesMediated schema

Data sourcecatalog

Reformulation engine

optimizer

Execution engine

-- Deciding what to optimize

--Getting the statistics on sources

--Doing the optimization

Kambhampati & Knoblock Information Integration on the Web (MA-1) 60

Query OptimizationImperative query execution plan:Declarative SQL query

Ideally: Want to find best plan. Practically: Avoid worst plans!

Goal:

Purchase Person

Buyer=name

City=‘seattle’ phone>’5430000’

buyer

(Simple Nested Loops)

(Table scan) (Index scan)

SELECT S.buyerFROM Purchase P, Person QWHERE P.buyer=Q.name AND

Q.city=‘seattle’ AND Q.phone > ‘5430000’

Inputs:• the query• statistics about the data(indexes, cardinalities,selectivity factors)• available memory


Information Integration

Data Integration Text/Data Integration Service integration

Data aggregation(vertical integration)

Data Linking(horizontal integration)

Collection Selection


Extremes of automation in Information Integration

• Fully automated II (blue sky for now)– Get a query from the user on

the mediator schema– Go “discover” relevant data

sources– Figure out their “schemas”– Map the schemas on to the

mediator schema– Reformulate the user query

into data source queries– Optimize and execute the

queries– Return the answers

• Fully hand-coded II– Decide on the only query

you want to support– Write a (java)script that

supports the query by accessing specific (pre-determined) sources, piping results (through known APIs) to specific other sources

• Examples include Google Map Mashups

E.g. We may start with known sources and their known schemas, do hand-mapping and support automated reformulation and optimization

(most interesting action is

“in between”)


What to Optimize

• Traditional DB optimizers compare candidate plans purely in terms of the time they take to produce all answers to a query.

• In Integration scenarios, the optimization is “multi-objective”– Total time of execution– Cost to first few tuples

• Often, the users are happier with plans that give first tuples faster

– Coverage of the plan• Full coverage is no longer an iron-clad requirement

– Too many relevant sources, Uncontrolled overlap between the sources

• Can’t call them all!

– (Robustness, – Access premiums…)


Roadmap

• We will first focus on optimization issues in vertical integration (“data aggregation” ) scenarios– Learning source statistics

– Using them to do source selection

• Then move to optimization issues in horizontal integration (“data linking”) scenarios.– Join optimization issues in data integration scenarios

Query Processing Issues in Data Aggregation

• Recall that in DA, all sources are exporting fragments of the same relation R– E.g. Employment opps; bibliography records; item/price records

etc– The fragment of R exported by a source may have fewer

columns and/or fewer rows• The main issue in DA is “Source Selection”

– Given a query q, which source(s) should be selected and in what order

• Objective: Call the least number of sources that will give most number of high-quality tuples in the least amount of time– Decision version: Call k sources that ….– Quality of tuples– may be domain specific (e.g. give lowest

price records) or domain independent (e.g. give tuples with fewest null values)


Issues affecting Source Selection in DA

• Source Overlap– In most cases you want to avoid calling overlapping sources– …but in some cases you want to call overlapping sources

• E.g. to get as much information about a tuple as possible; to get the lowest priced tuple etc.

• Source latency– You want to call sources that are likely to respond fast

• Source quality– You want to call sources that have high quality data

• Domain independent: E.g. High density (fewer null values)• Domain specific E.g. sources having lower cost books

• Source “consistency”?– Exports data that is error free


Learning Source Statistics• Coverage, overlap, latency, density and quality statistics about

sources are not likely to be exported by sources!– Need to learn them

• Most of the statistics are source and query specific– Coverage and Overlap of a source may depend on the query– Latency may depend on the query– Density may depend on the query

• Statistics can be learned in a qualitative or quantitative way• LCW vs. coverage/overlap statistics• Feasible access patterns vs. binding pattern specific latency statistics

– Quantitative is more general and amenable to learning

• Too costly to learn statistics w.r.t. each specific query– Challenge: Find right type of query classes with respect to which statistics

are learned• Query class definition may depend on the type of statistics

• Since sources, user population and network are all changing, statistics need to be maintained (through incremental changes)


Managing Source Overlap

• Often, sources on the Internet have overlapping contents– The overlap is not centrally managed (unlike DDBMS—

data replication etc.)• Reasoning about overlap is important for plan

optimality– We cannot possibly call all potentially relevant sources!

• Qns: How do we characterize, get and exploit source overlap?– Qualitative approaches (LCW statements)– Quantitative approaches (Coverage/Overlap statistics)


Local Completeness Information• If sources are incomplete, we need to look at each one of them.

• Often, sources are locally complete.

• Movie(title, director, year) complete for years after 1960, or for American directors.

• Question: given a set of local completeness statements, is a query Q’ a complete answer to Q?

True source contentsAdvertised description

Guarantees (LCW; Inter-source comparisons)

Problems: 1. Sources may not be interested in giving these! Need to learn hard to learn!

2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless


Quantitative ways of modeling inter-source overlap

• Coverage & Overlap statistics [Koller et. al., 97]

– S1 has 80% of the movies made after 1960; while S2 has 60% of the movies

– S1 has 98% of the movies stored in S2

• Computing cardinalities of unions given intersections

Who gives

these statistics?

-Third party

-Probing

S2

S1 S3

Extensionof R

Case Study: BibFinder

• BibFinder: A popular CS bibliographic mediator– Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore,

ScienceDirect, Network Bibliography, CSB, CiteSeer– More than 58000 real user queries collected

• Mediated schema relation in BibFinder: paper(title, author, conference/journal, year)

Primary key: title+author+year

• Focus on Selection queriesQ(title, author, year) :- paper(title, author, conference/journal, year), conference=SIGMOD

Selecting top-K sources for a given query

• Given a query Q, and sources S1….Sn, we need the coverage and overlap statistics of sources Si w.r.t. Q – P(S|Q) is the coverage (Probability that a random tuple

belonging to Q is exported by source S)– P({S1..Sj}|Q) is the overlap between S1..Sj w.r.t. query Q

(Probability that a random tuple belonging to Q is exported by all the sources S1..Sj).

– If we have the coverage and overlap statistics, then it is possible to pick the top-K sources that will give maximal number of tuples for Q.

Computing Effective Coverage provided by a set of sources

)|123(

)|13(

)|23(

)|21(

)|3()|2()|1()|321(

QSSSP

QSSP

QSSP

QSSP

QSPQSPQSPQSSSP

Suppose we are calling 3 sources S1, S2, S3 to answer a query Q. The effective coverage we get is P(S1US2US3|Q). In order to compute this union, we need the intersection (overlap) statistics (in addition to the coverage statistics)

Given the above, we can pick the optimal 3-sources for answering Q by considering all 3-sized subsets of source set S1….Sn, and picking the set with highest coverage

Selecting top-K sources: the greedy way

)|123(

)|13(

)|23(

)|3()2,1,|3(

QSSSP

QSSP

QSSP

QSPSSQSP

Selecting optimal K sources is hard in general. One way to reduce cost is to select sources greedily, one after other. For example, to select 3 sources, we select first source Si as the source with highest P(Si|Q) value. To pick the jth source, we will compute the residual coverage of each of the remaining sources, given the 1,2…j-1 sources we have already picked The residual coverage computation requires overlap statistics). For example picking a third source in the context of sources S1 and S2 will require us to calculate:

Challenges in gathering overlap statistics

• Sources are incomplete and partially overlapping

• Calling every possible source is inefficient and impolite

• Need coverage and overlap statistics to figure out what sources are most relevant for every possible query!

Coverage: probability that a random answertuple for query Q belongs to source S.Noted as P(S|Q).

Overlap: Degree to which sources containthe same answer tuples for query Q.Noted as P(S1 ̂S2 ̂… ̂Sk |Q).

DBLP

CSB

ACMDL

• We introduce a frequency-based approach for mining these statistics

OutlineMotivation

• BibFinder/StatMiner Architecture

• StatMiner Approach– Automatically learning AV Hierarchies

– Discovering frequent query classes

– Learning coverage and overlap Statistics

• Using Coverage and Overlap Statistics• StatMiner evaluation with BibFinder• Related Work• Conclusion

Motivation

• We introduce StatMiner– A threshold based hierarchical mining approach

– Store statistics w.r.t. query classes

– Keep more accurate statistics for more frequently asked queries

– Handling the efficiency and accuracy tradeoffs by adjusting the thresholds

• Challenges of gathering coverage and overlap statistics– It’s impractical to assume that the sources will export such statistics, because

the sources are autonomous. – It’s impractical to learn and store all the statistics for every query.

• Necessitate different statistics, is the number possible queries, is the number of sources

• Impractical to assume knowledge of entire query population a priori

SNsNQN 2

QN

BibFinder/StatMiner

Citeseer

LearnAV

Hierarchies

CSB DBLPACMDL

Netbib

ScienceDirect

DiscoverFrequent

QueryClasses

LearnCoverage

and Overlap

Statistics

Query List

UserQuery

AnswerTuples

Query List

Query Frequency Distinctive

Answers

Overlap (Coverage)

DBLP 35

CSB 23

CSB, DBLP 12

DBLP, Science 3

Science 3

CSB, DBLP, Science 1

Author=”andy king” 106 46

CSB, Science 1

CSB 16

DBLP 16

CSB, DBLP 7

ACMdl 5

ACMdl, CSB 3

ACMdl, DBLP 3

ACMdl, CSB, DBLP 2

Author=”fayyad”

Title=”data mining”

1 27

Science 1

Query List: the mediator maintains anXML log of all user queries, along withtheir access frequency, number oftotal distinct answers obtained, andnumber of answers from each sourceset which has answers for the query.

Each query q corresponds to aVector of coverage/overlapStatistics. If there are 3 sourcesS1, s2, s3, we have:[P(S1|q),P(S2|q), P(S3|q) P(S1&S2|q), P(S2&S3|q) P(S1&S3|q) P(S1&S2&S3|q) ] A sparse vector with exponential dimensions

By keeping thresholds on min overlap, we can avoid remembering small values The larger the thresholds, the sparser the vectors

Issues in Storing & Using Statistics• Storing statistics for each query is disadvantageous

– Too many queries– Stored statistics can only be useful if the same query comes up

again• Idea1: Focus on only frequently asked queries• Idea 2: Store statistics w.r.t. query classes

• Generate query classes by clustering..– When a new query comes, we can map it to some existing query

classes• But Clustering directly on queries won’t work

– Because we won’t know how to map a new query into existing query classes

– Idea: First do “subspace” clustering—cluster attribute values– A query class is then defined as a cross product of attribute value

clusters

AV Hierarchies and Query Classes

RT

2001 2002

AV Hierarchy for the Year Attribute

AI

SIGMOD ICDE AAAI ECP

RT

DB

AV Hierarchy for the Conference Attribute

RT,02 AI,RT

SIGMOD,RT ICDE,RT DB,02 AAAI,RT AI,01 ECP,RT

RT,01

SIGMOD01 ICDE02ICDE01 AAAI01

DB,01

ECP01

RT,RT

DB,RT

Query Class Hierarchy

Query Class: queries are grouped intoclasses by computing cartesianproducts over the AV Hierarchies.A query class is a set of queries thatall share a set of assignments ofparticular attributes to specific values.

Attribute-Value Hierarchy:An AV Hierarchy is a classification ofthe values of a particular attribute ofthe mediator relation. Leaf nodes inthe hierarchy correspond to concretevalues bound in a query.

StatMiner

Attribute values are extracted from thequery list.

Clustering similar attribute values leadsto finding similar selection queries basedon the similarity of their answerdistributions over the sources.

The AV Hierarchies are generated usingan agglomerative hierarchical clusteringalgorithm.

They are then flattened according totheir tightness.

Learning AV Hierarchies

Candidate frequent queryclasses are identified using theanti-monotone property.

Classes which are infrequentlymapped are then removed.

Discovering FrequentQuery Classes

i

ii QSPQSPQQd 2)]2|ˆ()1|ˆ([)2,1(

C2

A2A1

C2

A3

C1 A3

A1 A2

D(C1,C2) <=1/tightness(C1)

Flattened AV Hierarchy

Coverage and overlap statisticsare computed for each frequentquery class using a modifiedApriori algorithm.

Learning Coverage and Overlap

),()(

)(1

)(CQd

CP

QPCtightness

CQ

)(

)()|ˆ()|ˆ(

CP

QPQSPCSP CQ

A query is a vector ofoverlap statistics

Learned Conference Hierarchy

Using Coverage and Overlap Statistics to Rank Sources

1. A new user query is mapped to a set of leastgeneral query classes.

2. The mediator estimates the statistics for thequery using a weighted sum of the statistics ofthe mapped classes.

3. Data sources are ranked and called in order ofrelevance using the estimated statistics.In particular:

- The most relevant source has highestcoverage

- The next best source has highest residualcoverage

As a result, the maximum number of tuples areobtained while the least number of sources arecalled.

DBLP

CSB

ACMDL

Example:Here, CSB has highest coverage,followed by DBLP. However, sinceACMDL has higher residual coveragethan DBLP, the top 2 sources thatwould be called are CSB and ACMDL.

Outline• Motivation

• BibFinder/StatMiner Architecture

• StatMiner Approach– Automatically learning AV Hierarchies

– Discovering frequent query classes

– Learning coverage and overlap Statistics

• Using Coverage and Overlap StatisticsStatMiner evaluation with BibFinder• Related Work• Conclusion

BibFinder/StatMiner Evaluation

Purpose of the experiments: Analysis of space consumption Estimation of the accuracy of the learned statistics Evaluation of the effectiveness of those statistics in

BibFinder.

Query planning algorithms used in the experiments:- Random Select (RS): without any stats.- Simple Greedy (SG): only coverage stats.- Greedy Select (GS): coverage and overlap stats.

Precision of a plan: fraction of sources in theestimated plan which are the actual top sources.

Experimental setup with BibFinder:

•Mediator relation: Paper(title,author,conference/journal,year)•25000 real user queries are used. Among them 4500 queries are randomly chosen as test queries. • AV Hierarchies for all of the four attributes are learned automatically. • 8000 distinct values in author, 1200 frequent asked keywords itemsets in title, 600 distinct values in conference/journal, and 95 distinct values in year.

Learned Conference Hierarchy

Space Consumption for Different minfreq and minoverlap

• We use a threshold on the support of a class, called minfreq, to identify frequent classes

• We use a minimum support threshold minoverlap to prune overlap statistics for uncorrelated source sets.

• As we increase any of the these two thresholds, the memory consumption drops, especially in the beginning.

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1,600,000

0.03 0.13 0.23 0.33 0.43 0.53 0.63 0.73minfreq(%)

Mem

ory

Con

sum

ptio

n (b

ytes

)

minoverlap=0

minoverlap=0.1

minoverlap=0.2

minoverlap=0.3

Accuracy of the Learned Statistics

0

0.1

0.2

0.3

0.4

0.5

0.03 0.13 0.23 0.33 0.43 0.53 0.63 0.73minfreq(%)

Av

era

ge

Err

or

minoverlap=0

minoverlap=0.1

minoverlap=0.2

minoverlap=0.3

• Absolute Error

• No dramatic increases• Keeping very detailed

overlap statistics would not necessarily increase the accuracy while requiring much more space. For example: minfreq=0.13 and minoverlap=0.1 versus minfreq=0.33 and minoverlap=0

||

)]|ˆ()|ˆ('[ 2

etTestQueryS

QSPQSPetTestQuerySQ

i ii

Plan Precision

• Here we observe the average precision of the top-2 source plans

• The plans using our learned statistics have high precision compared to random select, and it decreases very slowly as we change the minfreq and minoverlap threshold.

0.4

0.5

0.6

0.7

0.8

0.9

1

0.03 0.13 0.23 0.33 0.43 0.53 0.63 0.73minfreq(%)

prec

isio

n

RSSG0GS0SG0.3

GS0.3

Plan Precision on Controlled Sources

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25

Threshold(%)

Pre

cis

ion

Greedy Select Simple Greedy Random Select

We observer the plan precision of top-5 source plans (totally 25 simulated sources). Using greedy select do produce better plans. See Section 3.8 and Section 3.9 for detailed information

Number of Distinct Results

• Here we observe the average number of distinct results of top-2 source plans.

• Our methods gets on average 50 distinct answers, while random search gets only about 30 answers.28

33

38

43

48

53

0.03 0.13 0.23 0.33 0.43 0.53 0.63 0.73minfreq(%)

Nu

mb

er o

f d

isti

nct

an

swer

s

RS

SG0

GS0

SG0.3

GS0.3

Latency statistics(Or what good is coverage without good response time?)

• Sources vary significantly in terms of their response times– The response time depends

both on the source itself, as well as the query that is asked of it

• Specifically, what fields are bound in the selection query can make a difference

• ..So, learn statistics w.r.t. binding patterns

Query Binding Patterns

• A binding pattern refers to which arguments of a relational query are “bound”– Given a relation S(X,Y,Z)

• A query S(“Rao”, Y, “Tom”) has binding pattern bfb• A query S(X,Y, “TOM”) has binding pattern ffb

• Binding patterns can be generalized to take “types” of bindings– E.g. S(X,Y,1) may be ffn (n being numeric binding) and– S(X,Y, “TOM”) may be ffs (s being string binding)

• Sources tend to have different latencies based on the binding pattern– In extreme cases, certain binding patterns may have infinite

latency (i.e., you are not allowed to ask that query)• Called “infeasible” binding patterns

(Digression)

• LCWs are the “qualitative” versions of quantitative coverage/overlap statistics

• Feasible binding patterns are “qualitative” versions of quantitative latency statistics

Binding-specific latency stats are more effective

Combining coverage and response time• Qn: How do we define an optimal plan in the context

of both coverage/overlap and response time requirements?– An instance of “multi-objective” optimization

• General solution involves presenting a set of “pareto-optimal” solutions to the user and let her decide

– Pareto-optimal set is a set of solutions where no solution is dominated by another one in all optimization dimensions (i.e., both better coverage and lower response time)

• Another idea is to combine both objectives into a single weighted objective

It is possible to optimize for first tuples

Different “kinds” of plans

Date post:	12-Jan-2016
Category:	Documents
Upload:	aulii
View:	33 times
Download:	1 times

Query Processing in Data Integration (Gathering and Using Source Statistics)

Documents