Optimizing Multiple Continuous Queries

Optimizing Multiple Continuous Queries

Dissertation Defense

Chun Jin

Thesis CommitteeJaime Carbonell (Chair)Christopher Olston, on leave at Yahoo! ResearchJamie CallanPhil Hayes, Vivisimo, Inc.

October 31, 2006, Carnegie Mellon

Chun Jin Carnegie Mellon 2

Emerging Stream Applications

•Intelligence monitoring•Fraud detection•Onset epidemic patterns•Network intrusion detection•GeoSpatial change detection

•Transactions•Senor network readings•Network traffic data

Chun Jin Carnegie Mellon 3Analyst A Analyst B

Stream MatchingContinuous

Queries

Terr

oris

m A

lert

s

Fraud Alerts

Novelty Detection

New Connections

New P

atte

rns

Ad hoc Query Matching

New Continuous

Queries

Data Streams

Ad h

oc e

xplo

ring

ARGUS: Toward Collaborative Intelligence Analysis


Challenges Large-Scale (~103) continuous queries On FAST (104-105tuples/day) continuous

streams With LARGE (~106tuples) historical DBs.… but computation-sharable and highly-

selective queries Support stream processing for a broad

range of queries on existing DB applications.

… but DBMS technologies.


Problems Efficiency and scalability

Continuous query evaluation Multiple/Large-scale queries

Practicality Utilize DBMS legacy systems to

support stream processing on a broad range of queries.


Approaches Efficiency and scalability

Incremental query evaluation Incremental multiple query optimization

(IMQO) Query optimization

Practicality Built atop DBMSs Use SQL as the query language

Shows up-to hundreds-fold improvement(Details coming up)

Selection/join queries


MQO is NP-hard![Sellis90]

Challenges to Multiple Query Optimization (MQO)

Q1

Q2

…QK

timet1 t2 tK0 …

Q1 Q2 QK…

Incremental MQO (IMQO)


Performing IMQOQ1

Q2

…QK

QN

SELECT …FROM …WHERE …

1. Index R 2. Identify common computations between

QN and R3. Select optimal sharing paths 4. Expand R with new computations

Query Network R


Related Work Efficiency and

Scalability: Incremental evaluation:

Stream operators Join(Rete) [Forgy82] [Urhan

et al,00] [Viglas et al,03]

Aggregate [Haas et al,99] IMQO: Stream Processing

Projects NiagaraCQ, TelegraphCQ

[Chen et al,00] [Chandrasekaran et al,03]

STREAM, Aurora, Gigascope [Motwani et al,03][Abadi et al,03] [Cranor et al,03]

ARGUS[Jin et al,05][Jin et al,06] Practicality Comprehensive IMQO

framework Richer query syntax and

semantics Canonicalization More flexible plan

structures More general sharing

strategies


Thesis Statement The thesis demonstrates constructively

that incremental multiple query optimization, incremental evaluation, and other query optimization techniques provide very significant performance improvements for large-scale continuous queries.

The methods can function atop existing DBMS systems for maximal modularity and direct practical utility.

The methods work well across diverse applications.


Data Tables

Analyst

Input Streams

Query NetworkSystemCatalog

IMQOModule

SingleQueryOptimizer

CodeAssembler

PlanInstantiator

Register queries

Result streams

Register & initialize query network

ARGUS Query Network Generator ARGUS Execution Engine

ARGUS Stream Processing


SystemCatalog

IncrementalMulti-QueryOptimizer

Single-QueryOptimizer

CodeAssembler

PlanInstantiator

ARGUS Query Network GeneratorParser

Canonicalizer

Index & SearchInterface

QueryRewriter

ARGUSManager

SQLQuery

Initiation and execution code

Query Network Generator


Query Example Suppose for every big transaction of type

code 1000 or 2000, the analyst wants to check if the money stayed in the bank or left within twenty days. An additional sign of possible fraud is that the transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within twenty days of this transaction using an intermediate bank.


The Query in CNFSELECT *FROM Fed r1, Fed r2, Fed r3WHERE

(r1.type_code = 1000 OR r1.type_code = 2000)AND r1.amount > 1000000AND (r2.type_code = 1000 OR r2.type_code = 2000)AND r2.amount > 500000AND (r3.type_code = 1000 OR r3.type_code = 2000)AND r3.amount > 500000AND r1.rbank_aba = r2.sbank_abaAND r1.benef_account = r2.orig_accountAND r2.amount > r1.amount / 2AND r1.tran_date <= r2.tran_dateAND r2.tran_date <= r1.tran_date + 20AND r2.rbank_aba = r3.sbank_abaAND r2.benef_account = r3.orig_accountAND r2.amount = r3.amountAND r2.tran_date <= r3.tran_dateAND r3.tran_date <= r2.tran_date + 20;

F S1 S2 J1 J2

S1S1S2

J1

J2


Identify Sharable Computations

SELECT *FROM Fed r1, Fed r2, Fed r3WHERE

(r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_abaAND r1.benef_account = r2.orig_accountAND r2.amount * 2 > r1.amountAND r1.tran_date <= r2.tran_dateAND r2.tran_date - 10 <= r1.tran_dateAND r2.rbank_aba = r3.sbank_abaAND r2.benef_account = r3.orig_accountAND r2.amount = r3.amountAND r2.tran_date <= r3.tran_dateAND r3.tran_date - 10 <= r2.tran_date;

F S1 S2 J1 J2

1. Literal predicates1. Equivalency2. Subsumption

2. OR predicates3. Predicate sets4. Topology

Sharing strategiesSelf-join

r2.amount > r1.amount/2

r3.tran_date <= r2.tran_date + 20PJ1

ORp3ORp4ORp1ORp2ORp1ORp2

J3

J4

J4


S1

PS1

S2

PS2

ORp1 ORp2 ORp4

p11 p2 p4p12

Computation Hierarchy

subsumption

subsumption

sharable

Fed.type_code = 1000 ORFed.type_code = 2000 Fed.amount > 1000000

subsumption

Fed.amount > 500000

Chun Jin Carnegie Mellon 17Literal Pred

Associates

ORpid

psetid

type

name

text

OR Pred

Node

BelongsTo

BelongsTo

IsAChild

PredSet

pid

ER Model for Hierarchy


Problems in Index/Search Rich syntax Canonicalization Subsumption

Literal predicate: subsumption + canonicalization

triple-string canonical form ORPred/PredSet algorithms

Self-join + canonicalization Standard Table Alias (STA) assignment

Topology multiple topology indexing

(Details coming up)


Canonicalization Equivalency:

r2.amount > r1.amount / 2r2.amount *2 > r1.amount r2.amount * 2 – r1.amount > 0

Subsumption:r2.tran_date <= r1.tran_date + 20r2.tran_date – r1.tran_date <= 20r2.tran_date – 10 <= r1.tran_dater2.tran_date – r1.tran_date <= 10

Triple-string canonical form: attribute-expression op constant


Self-Join Canonical forms refer to true table

names. Not good for self-join predicates:

r1.benef_account = r2.orig_accout Fed. benef_account = Fed.orig_accout

Use Standard Table Alias (STA) T1. benef_account = T2.orig_accout Enumerate STA assignments to find matches


Self-Join in ORPred/PredSet Layers

OR Predicate: (r1.c=1000 OR r1.a=r2.b) (Fed.c=1000 OR T1.a=T2.b) ? (T1.c=1000 OR T1.a=T2.b) ?

Add STA when indexing OR Predicates

Similar on Predicate Sets


Subsumption at ORPred Layer Input: ORPred p POutput: All ORPreds r R,

s.t. pr.Algorithm:

For each ρ p,Find γ r, such that ργ

For each r found, Count # of γ that subsumes ρ, |I(r)|If |I(r)|=|p|pr


Topological Connections

B1

S2

S1

S4

S3

J1 J4 J7

S5

S6


System CatalogNode

JVOA1 JVOA2 JVOAPSetID

DParent1

DParent2

DPSetID

Distinct

ORPredID

LPredID

LExpr

Op

RExpr

Node1

Node2

STA UseSTA

PSetID PredID STA

JoinTopologyIndex

PredicateSetIndex

PredicateIndex

Node

JVOA1 JVOA2 JVOAPSetID

DParent

DPSetID

SVOA

SVOAPSetID

Distinct

SelectionTopologyIndex


Indexing & Searchingr2.type_code = 1000r3.type_code = 1000r1.type_code = 1000r1.amount > 1000000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account

r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date

r2.tran_date – 10 <= r1.tran_dater2.rbank_aba = r3.sbank_aba

r2.benef_account = r3.orig_accountr2.amount = r3.amount

r2.tran_date <= r3.tran_dater3.tran_date – 10 <= r2.tran_date

r1.type_code = 1000r1.amount > 1000000r2.type_code = 1000r2.amount > 500000r3.type_code = 1000r3.amount > 500000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account

r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date

r2.tran_date – 10 <= r1.tran_dater2.rbank_aba = r3.sbank_aba

r2.benef_account = r3.orig_accountr2.amount = r3.amount

r2.tran_date <= r3.tran_dater3.tran_date – 10 <= r2.tran_date

T2.amount * 2 – T1.amount > 0

T2.tran_date – T1.tran_date <= 10

System Catalog

PredID CanonicalForm …PredSetID PredID …Node PredSetID …

PredicateIndex

PredicateSetIndex

TopologyIndex

CanonicalizationInference & Classification

CommonComputation

Searching

ComputationIndexing


Sharing Strategies

(a) Query network R

(b-2) Optimal plan for Q(c-2) Match-plan

J1B2

B1

B3J2 J3

(b-1) Joins in Q

1

2 (c-1) Sharing-selection

B2

B1

B3J2 J3

J1B2

B1

?B2

B1

B2

B3?

J1B2

B1

B3

J2


Evaluation Databases:

Synthesized FedWire money transfers (Fed 500000 records)

Anonymized Medical patient admission records (Med 835890 records)

Queries: Seed queries Generate sharable queries from seeds A wide range of queries

Simulation: Historical data (300000 on Fed, 600000 on Med) Chunks of new data (4000 per chunk, etc.)


Improvement Factors

DBMS 1x

ARGUS1-500x

Incremental Evaluation1-100x

ConditionalMaterialization1.2-1.8x

Join OrderOptimization1-10x

TransitivityInference1-20x

Canonicalization1-10x

IMQO1-50x


0

5000

10000

15000

20000

25000

0 100 200 300 400 500 600 700 800

WQ

NS..

Fed IMQO & Canonicalization

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

0

50

100

150

200

250

0 100 200 300 400 500 600 700 800

# of queries

Exec

ution

Tim

e(s)

......

AllSharing NonCanon NonJoinS

# of queries

WQNS: weighted query network size


0

1200

2400

3600

4800

6000

0 100 200 300 400 500 600 700 800# of queries

WQ

NS..

Fed Sharing Strategies


0

20

40

60

80

100

0 100 200 300 400 500 600 700 800# of queries

Exec

ution

Tim

e (s

).....

.

SharingSel MatchPlan MatchPlan+NCanon


Summary of Contributions Efficiency and scalability

Continuous queries Incremental query evaluation Multiple/large-scale queries Incremental multiple

query optimization (IMQO) Query optimization

Practicality Existing DB applications Built atop DBMSs A broad range of query syntax and semantics

Support Evaluation

Shows up-to hundreds-fold improvement Works across various domains


Future Work Generalization of current work

Support multi-way joins More sophisticated sharing strategies

Rerouting Restructuring

Adaptive query processing Adaptive re-optimization: rerouting and restructuring Adaptive rescheduling

New infrastructure Parallel/distributive processing Automatic tuning: index selection

Support new data types Text Multimedia


Acknowledgement Advisor: Jaime Carbonell. Committee: Chris Olston, Jamie Callan,

and Phil Hayes CMU and Dynamix ARGUS team: Jaime

Carbonell, Phil Hayes, Santosh Ananthraman, Cenk Gazen, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew, and Aaron Goldstein.

CMU faculty and friends: many …


Thank you!

Questions and comments?


Outline Motivation System and methods:

System architecture Execution engine Query network structures

IMQO framework Query network generator Query examples Hierarchy/ER Model Problems and solutions System catalog Sharing strategies

Evaluation Conclusion and future work


Adapted Rete Algorithm (Join) Join on N and M (N+ΔN) (M+ΔM)

= N M + ΔN M + N ΔM + ΔN ΔM

When ΔN and ΔM are very small compared to N and M, time complexity of incremental join is O(N+M)

Old Results New Incremental Results


N

M

J

Compute ΔJby ΔN M

N ΔMΔN ΔM

N

histnew

M

histnew

Jhist

new

N.rbank_aba = M.sbank_abaN.benef_account = M.orig_accountM.amount > N.amount*0.5N.tran_date <= M.tran_dateM.tran_date >= N.tran_date+20

Incremental Evaluation

ΔN

ΔM

ΔJ


F S1 S2 J1 J2

F

histtemp

Compute S1_temp byselecting from F_temp Compute J1_temp by

joining S1_temp and S2_hist,joining S1_hist and S2_temp,

and joining S1_temp and S2_temp

S1

histtemp

S2

histtemp J1

histtemp

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_accountr2.amount > r1.amount*0.5r1.tran_date <= r2.tran_dater2.tran_date >= r1.tran_date+20

type_code=1000amount>500000

Incremental Evaluation


Code Generation Code template for each operator Code block for each node Sort the code blocks Wrap up code blocks in Oracle

stored procedures Register and periodical execution


Projection Management

B1

B2

S1

S2

J1


Transitivity Inference Example Given

r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount

We can infer highly-selective predicates:

r2.amount > 500000 r3.amount > 500000


Query Optimizer Similar to traditional enumeration-based

query optimizer Optimize

Join order Conditional materialization

Active List

Join Graph

StructureBuilder

JoinEnumerator

History-basedCost Estimator DB

SQL Query

Plan

Update System Catalog

History-basedQuery Optimizer


Conditional Materialization

r2

r1

r2

r1

Unconditional Materialization

Conditional Materialization:Choose materialization or not based on cost estimates


Selection/Join Incremental Evaluation (Fed)

0

10

20

30

40

50

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Exec

utio

n Ti

me(

s)

Rete Data1 DBMS Data1 Rete Data2 DBMS Data2

HP PC, Single core Pentium(R) 4 CPU, 1.7GHz, 512M RAM, Windows XP, Oracle 10.1.0


Fed Comparing All

0

50

100

150

200

250

0 100 200 300 400 500 600 700 800

# of queries

Exec

ution

Tim

e(s)

......

AllSharing NonCanon NonJoinSMatchPlan MatchPlan+NCanon



Med Comparing All

0

30

60

90

120

0 100 200 300 400 500 600

# of queries

Exec

ution

Tim

e (s

).....

.

AllSharing NonCanonNonJoinS MatchPlanMatchPlan+NCanon



Med IMQO & Canonicalization

0300060009000

1200015000

0 100 200 300 400 500 600# of queries

WQ

NS ..

0

30

60

90

120

0 100 200 300 400 500 600

# of queries

Exec

ution

Tim

e (s

).....

.

AllSharing NonCanon NonJoinS



Med Sharing Strategies

0

20

40

60

80

0 100 200 300 400 500 600

# of queries

Exec

ution

Tim

e (s

).....

.

SharingSel MatchPlan MatchPlan+NCanon

0

2000

4000

6000

8000

0 100 200 300 400 500 600# of queries

WQ

NS ..


Date post:	20-Mar-2016
Category:	Documents
Upload:	lada
View:	24 times
Download:	1 times

Optimizing Multiple Continuous Queries

Documents