+ All Categories
Home > Documents > Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil...

Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil...

Date post: 18-Dec-2015
Category:
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
22
Optimizing source-call Optimizing source-call ordering in ordering in Information Gathering Information Gathering Plans Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University http://rakaposhi.eas.asu.edu/ yochan.html
Transcript
Page 1: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Optimizing source-call Optimizing source-call ordering in Information ordering in Information

Gathering PlansGathering Plans

Subbarao Kambhampati

Senthil Gnanaprakasam

Arizona State University

http://rakaposhi.eas.asu.edu/yochan.html

Page 2: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

<html>

cgi

wrapper wrapper db

Gatherer user

Planning for Information Planning for Information GatheringGathering

Page 3: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Build query plan using source inversion

Logical Optimizations:Redundancy removal

Execution Optimizations: Source call ordering

Execute query plan

[Duschka, Genesereth 97]

EMERAC Information Gatherer

EMERAC Query Planning EMERAC Query Planning SystemSystem

Today’s talk

Talk on Friday AMSearch & Info Gathering

Page 4: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

The ProblemThe Problem

• Plans for gathering information on the internet can Plans for gathering information on the internet can be modeled as be modeled as datalogdatalog programs whose EDB programs whose EDB predicates correspond to calls to internet sources.predicates correspond to calls to internet sources.

• Optimizing the execution of these programs Optimizing the execution of these programs involves optimizing the ordering of source callsinvolves optimizing the ordering of source calls

– In the special case where the IG plans are conjunctive In the special case where the IG plans are conjunctive queries, this problem reduces to queries, this problem reduces to join orderingjoin ordering problemproblem

Page 5: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

OverviewOverview• Inadequacy of the traditional optimization methodsInadequacy of the traditional optimization methods

• Source access Limitations: RepresentationSource access Limitations: Representation

– % and $ annotations% and $ annotations

• Issues in ordering source callsIssues in ordering source calls

• Our approach: Assumptions & OverviewOur approach: Assumptions & Overview

– HTBP tableHTBP table

– AlgorithmAlgorithm

– ExampleExample

• Implementation in EMERACImplementation in EMERAC

• Related work & ConclusionRelated work & Conclusion

Page 6: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Inadequacy of Traditional Inadequacy of Traditional methodsmethods

• All sources are assumed to All sources are assumed to be fully relationalbe fully relational

• Sources are rarely fully Sources are rarely fully relational relational

– Only limited types of Only limited types of queries allowedqueries allowed

• Wrapped web-pagesWrapped web-pages

• Form-interfaced Form-interfaced databasesdatabases

• Certain forms of join Certain forms of join computation may be computation may be precludedprecluded

– Need to model query Need to model query capabilitiescapabilities

Traditional Information Gathering

Page 7: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Inadequacy of Traditional Inadequacy of Traditional methodsmethods

• Tuple-transfer costs are Tuple-transfer costs are assumed to dominate the assumed to dominate the query-execution costsquery-execution costs

– Use of “Bound-is-easier” Use of “Bound-is-easier” assumptionassumption

• Assume availability of full Assume availability of full source-statisticssource-statistics

– Selectivity indices, Selectivity indices, histograms etc. histograms etc.

• Access cost & source latencies Access cost & source latencies tend to equal or dominate the tend to equal or dominate the transfer costtransfer cost

– Need to consider number of Need to consider number of source callssource calls

– Need for considering bushy Need for considering bushy joins (instead of just left-linear joins (instead of just left-linear join trees) join trees)

• Full statistics are rarely available Full statistics are rarely available about internet sourcesabout internet sources

– Sources are decentralized and Sources are decentralized and autonomousautonomous

– Difficult to do systematic Difficult to do systematic optimizationoptimization

[Continued]

Page 8: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Source Access LimitationsSource Access Limitations

• Sources can have a variety of access limitationsSources can have a variety of access limitations

– Form interfaced databases may require certain attributes to be Form interfaced databases may require certain attributes to be boundbound

• Whitepages may require the name of the personWhitepages may require the name of the person– To get the numbers of a set of To get the numbers of a set of nn people, we will have to access the people, we will have to access the

source source nn times times

– and may be unable to handle bindings of other attributesand may be unable to handle bindings of other attributes

• A Whitepages database may not take the address of a A Whitepages database may not take the address of a person as a bound attributeperson as a bound attribute

– To get the number of John Doe, who lives on Lemon St, we will To get the number of John Doe, who lives on Lemon St, we will have to get the numbers of have to get the numbers of allall John Does, and locally filter the ones John Does, and locally filter the ones not living on Lemon Street not living on Lemon Street

– Wrapped web-pages cannot select over any attributesWrapped web-pages cannot select over any attributes

Page 9: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Representing Source Representing Source Access LimitationsAccess Limitations

• Use annotations on the attributes of the source relationUse annotations on the attributes of the source relation

– ““$$” annotation identifies attributes that ” annotation identifies attributes that mustmust be bound be bound

– ““%%” annotation identifies un-selectable attributes” annotation identifies un-selectable attributes

• S($X,%Y,Z) S($X,%Y,Z) – A form-interfaced web-page that requires bindings for X and is able to A form-interfaced web-page that requires bindings for X and is able to

do selections only on Z.do selections only on Z.

• $ and % annotations help identify feasible binding patterns for $ and % annotations help identify feasible binding patterns for sourcessources

– SSb-- b-- are feasible; Sare feasible; Sf--f-- are infeasible; are infeasible;

– SSbbf bbf must be modeled as S must be modeled as Sbffbff filtered locally with binding on Y filtered locally with binding on Y

Page 10: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Relating binding patternsRelating binding patterns

• Generality of binding patterns Generality of binding patterns

– SSp p is more general than Sis more general than Sqq if every non-%-annotated if every non-%-annotated attribute that is free in q is also free in p (but not vice versa)attribute that is free in q is also free in p (but not vice versa)

• Call to S with binding pattern p will subsume the results Call to S with binding pattern p will subsume the results of call to S with binding pattern qof call to S with binding pattern q

• For S($X,%Y,Z), SFor S($X,%Y,Z), Sbbfbbf is more general than S is more general than Sbfbbfb

– Holds only because of % annotationsHolds only because of % annotations

– #(B) is the number of bound variables in the binding #(B) is the number of bound variables in the binding pattern B that are not %-annotatedpattern B that are not %-annotated

• #(.) is used to relate binding patterns of different #(.) is used to relate binding patterns of different sources (as in “bound-is-easier” assumption)sources (as in “bound-is-easier” assumption)

Page 11: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Issues in ordering source Issues in ordering source callscalls

• Execution cost is a function of both access cost and the tuple-transfer cost (Execution cost is a function of both access cost and the tuple-transfer cost ( ignoring ignoring local processing costs…)local processing costs…)

• Tension between access costs & traffic costsTension between access costs & traffic costs– E.g. Execute “E.g. Execute “S1(W,X) & S2(X,Y)S1(W,X) & S2(X,Y)” where the query binds W ” where the query binds W

– Tuple-transfer cost reduction motivates calling sources with the least general binding patterns Tuple-transfer cost reduction motivates calling sources with the least general binding patterns possiblepossible

• Bound-is-easier (S1 first, and then feed X bindings to S2)Bound-is-easier (S1 first, and then feed X bindings to S2)

– Access cost reduction motivates calling sources with the most general binding patterns Access cost reduction motivates calling sources with the most general binding patterns possiblepossible

• Feeding X bindings for S2 will generate many separate accesses, increasing the access costFeeding X bindings for S2 will generate many separate accesses, increasing the access cost

sttransfer

sst

taccess

ssa DCnCMinimize

coscos

**

Page 12: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Our Approach: Our Approach: AssumptionsAssumptions

• Exact optimization is not worth it…Exact optimization is not worth it…

– Lack of full source statisticsLack of full source statistics

– NP-hardness of the optimization problemNP-hardness of the optimization problem

• Join-ordering, which is a special case, is already Join-ordering, which is a special case, is already NP-CompleteNP-Complete

• Source access costs dominate tuple-transfer costs Source access costs dominate tuple-transfer costs by defaultby default

– Reasonable given the large setup and latency costs Reasonable given the large setup and latency costs for internet sourcesfor internet sources

Page 13: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Our Approach: OverviewOur Approach: Overview• A greedy approach (along the lines of “bound-is-easier” type A greedy approach (along the lines of “bound-is-easier” type

procedures)procedures)

• By default, attempts to access each source with the most general By default, attempts to access each source with the most general feasible binding patternfeasible binding pattern

– Reasonable given the assumption that access costs dominate transfer Reasonable given the assumption that access costs dominate transfer costscosts

• The default is over-ridden if a binding pattern is known to produce The default is over-ridden if a binding pattern is known to produce too much traffictoo much traffic

– Binding patterns producing high traffic are stored in a table called Binding patterns producing high traffic are stored in a table called HTBPHTBP

• Implicitly produces bushy join treesImplicitly produces bushy join trees

Page 14: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

The HTBP TableThe HTBP Table• The HTBP table contains, for every source S, the least general The HTBP table contains, for every source S, the least general

binding patterns of S which are known to produce “high” trafficbinding patterns of S which are known to produce “high” traffic

– A call to source S with binding pattern B is considered high-traffic A call to source S with binding pattern B is considered high-traffic producing, if HTBP contains Sproducing, if HTBP contains SB’ B’ and B is either equal or more general and B is either equal or more general than B’than B’

– E.g. E.g. Book(Author,Title,ISBN,Subj,Price,Pages)Book(Author,Title,ISBN,Subj,Price,Pages)

• HTBP may contain all binding patterns that do not bind at least HTBP may contain all binding patterns that do not bind at least one of the first four attributesone of the first four attributes

– BookBookffffbb ffffbb listed explicitly in HTBPlisted explicitly in HTBP– BookBookfffffb fffffb BookBookfffffbf fffffbf BookBookffffffffffff

would be considered to be implicitly in HTBPwould be considered to be implicitly in HTBP

• Advantage: HTBP should be easy to specify even if full source Advantage: HTBP should be easy to specify even if full source statistics are not availablestatistics are not available

Page 15: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

The Algorithm--IThe Algorithm--I• Input: Input:

– FBPFBP: Table of forbidden binding patterns: Table of forbidden binding patterns

• Constructed from $ annotationsConstructed from $ annotations

– HTBPHTBP: High traffic binding patterns: High traffic binding patterns

– A conjunction of A conjunction of mm subgoals making up the query plan subgoals making up the query plan

• Data Structures: Data Structures:

– C[1…m]C[1…m] C[i] lists source calls (source, BP) to be done at stage i C[i] lists source calls (source, BP) to be done at stage i

• More than one source call possible at each stageMore than one source call possible at each stage– Implicit bushy joinsImplicit bushy joins

– P[1…m]P[1…m] where P[i] list of sources postponed at stage i where P[i] list of sources postponed at stage i

• Postponement is done if there is no binding pattern for a source that is both Postponement is done if there is no binding pattern for a source that is both feasible, and not contained in feasible, and not contained in HTBPHTBP

– VV is the list of variables for which bindings exist is the list of variables for which bindings exist

• initialized to the variables bound in the queryinitialized to the variables bound in the query

Page 16: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

The Algorithm--2The Algorithm--2

For each stage i from 1 to m do For each unchosen subgoal S pick the most general & feasible BP B of S w.r.t. V & FBP such that B is not in HTBP. If such a B exists, Push SB into C[i]. Mark S chosen. Add all variables of S to V If no such B exists, but there is a feasible binding pattern for S Pick the BP B’ with most bound variables (in terms of #(.)) Push SB’ into P[i] If no subgoal has been chosen at this level (C[i] is empty), and there are some postponed sources (P[i] is non-empty) Choose Sk

B in P[i] with the maximum #(B) value Push Sk

B into C[i] Add all variables of Sk to V Return the array C[1…m]

Default case: Reduce accesses

HTBP case: Reduce transfer costs

Page 17: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

ExampleExample•Sources: DP(A:Author,T:Title,Y:Year)

SM98(T:Title,U:URL)

•Query: Q(A,T,U,1998)

•Plan: Q(A,T,U,1998) :- DP(A,T,1998) & SM98(T,U)

HTBP: {DPbbb SM98bb}

Step 1. V={Y}

Cand: DPfff DPffb SM98ff

XX XX XX

P[1] = {DPffb SM98ff}

C[1] = DPffb

Step 2. V={A,T,Y}

Cand: SM98ff SM98bf

XX XX

P[2]={SM98bf}

C[2]=SM98bf

HTBP: {DPffb}

Step 1. V={Y}

Cand: DPfff DPffb SM98ff

XX XX

C[1] = SM98ff

Step 2. V={Y, U, T}

Cand: DPfff DPffb DPfbf DPfbb

XX XX XX

C[2] = DPfbf

HTBP: {}

Step 1. V={Y}

Cand: DPfff DPffb SM98ff

C[1] = SM98ff DPfff

Bound-is-easier

Page 18: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Implemented the technique in the Emerac Information Gatherer

Experimented with simulated sources derived form DBLP data

-- Our approach tended to reduce the total cost over bound-is-easier approach whenever there were significant number of binding patterns that are not subsumed by HBTP

ImplementationImplementation

A prototype Information Gatherer written in JAVA --Incorporates recursive plan minimization & execution ordering --Threading execution

Partial results returned asynchronously

Page 19: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

LCW vs. Naïve [DBLP Sources]LCW vs. Naïve [DBLP Sources]

1.00E+03

1.00E+04

1.00E+05

1.00E+06

1.00E+07

1.00E+08

1 2 3 4 5 6 7 8

# redundant constrained sources

Tim

e t

o p

lan

& E

xe

cu

te (

in m

. se

c.)

(lo

g)

Naive 256 (1)

LCW 256 (1)

Naive 256 (3)

LCW 256 (3)

Page 20: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Related WorkRelated Work• Other ordering approachesOther ordering approaches

– Access cost minimization (approximating with bounds)Access cost minimization (approximating with bounds)

• [Yerneni & Li, 1999 ][Yerneni & Li, 1999 ]

• Does not consider transfer costsDoes not consider transfer costs

• Other work handling source access capabilitiesOther work handling source access capabilities

– Rule-based representation of source access capabilitiesRule-based representation of source access capabilities

• GARLIC [Haas et. al., 1997]GARLIC [Haas et. al., 1997]

– Representing constraints such as “select over A, B or C”Representing constraints such as “select over A, B or C”

• [Garcia-Molina et. al. 1999][Garcia-Molina et. al. 1999]

• Learning source statistics through probe queriesLearning source statistics through probe queries

– [Zhu & Larson; 1996][Zhu & Larson; 1996]

Page 21: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

SummarySummary• Argued that source call ordering for information Argued that source call ordering for information

gathering plans is significantly different from the gathering plans is significantly different from the traditional join orderingtraditional join ordering

• Developed an approach for source call ordering that Developed an approach for source call ordering that takes takes both access and transfer costsboth access and transfer costs into account into account

– Use of $ and % annotations to represent access limitationsUse of $ and % annotations to represent access limitations

– Use of HTBP table to indicate binding patterns that Use of HTBP table to indicate binding patterns that produce high-trafficproduce high-traffic

• Does not require elaborate statisticsDoes not require elaborate statistics

• Subsumes bound-is-easier as a special caseSubsumes bound-is-easier as a special case

– Supports bushy joins--thus allowing parallelism and Supports bushy joins--thus allowing parallelism and reduce the impact of connection delaysreduce the impact of connection delays

Page 22: Optimizing source-call ordering in Information Gathering Plans Subbarao Kambhampati Senthil Gnanaprakasam Arizona State University .

Current directionsCurrent directions

• HTBP representations with better granularityHTBP representations with better granularity

• Integrate the plan minimization and execution Integrate the plan minimization and execution ordering stages in EMERACordering stages in EMERAC

• More realistic evaluation of the gains of the ordering More realistic evaluation of the gains of the ordering strategy strategy


Recommended