+ All Categories
Home > Documents > Optimizing Multiple Continuous Queries

Optimizing Multiple Continuous Queries

Date post: 20-Mar-2016
Category:
Upload: lada
View: 24 times
Download: 1 times
Share this document with a friend
Description:
Optimizing Multiple Continuous Queries. Dissertation Defense Chun Jin. Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research Jamie Callan Phil Hayes, Vivisimo, Inc. October 31, 2006, Carnegie Mellon. Emerging Stream Applications. - PowerPoint PPT Presentation
Popular Tags:
48
Optimizing Multiple Continuous Queries Dissertation Defense Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston, on leave at Yahoo! Research Jamie Callan Phil Hayes, Vivisimo, Inc. October 31, 2006, Carnegie Mellon
Transcript
Page 1: Optimizing  Multiple Continuous Queries

Optimizing Multiple Continuous Queries

Dissertation Defense

Chun Jin

Thesis CommitteeJaime Carbonell (Chair)Christopher Olston, on leave at Yahoo! ResearchJamie CallanPhil Hayes, Vivisimo, Inc.

October 31, 2006, Carnegie Mellon

Page 2: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 2

Emerging Stream Applications

•Intelligence monitoring•Fraud detection•Onset epidemic patterns•Network intrusion detection•GeoSpatial change detection

•Transactions•Senor network readings•Network traffic data

Page 3: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 3Analyst A Analyst B

Stream MatchingContinuous

Queries

Terr

oris

m A

lert

s

Fraud Alerts

Novelty Detection

New Connections

New P

atte

rns

Ad hoc Query Matching

New Continuous

Queries

Data Streams

Ad h

oc e

xplo

ring

ARGUS: Toward Collaborative Intelligence Analysis

Page 4: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 4

Challenges Large-Scale (~103) continuous queries On FAST (104-105tuples/day) continuous

streams With LARGE (~106tuples) historical DBs.… but computation-sharable and highly-

selective queries Support stream processing for a broad

range of queries on existing DB applications.

… but DBMS technologies.

Page 5: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 5

Problems Efficiency and scalability

Continuous query evaluation Multiple/Large-scale queries

Practicality Utilize DBMS legacy systems to

support stream processing on a broad range of queries.

Page 6: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 6

Approaches Efficiency and scalability

Incremental query evaluation Incremental multiple query optimization

(IMQO) Query optimization

Practicality Built atop DBMSs Use SQL as the query language

Shows up-to hundreds-fold improvement(Details coming up)

Selection/join queries

Page 7: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 7

MQO is NP-hard![Sellis90]

Challenges to Multiple Query Optimization (MQO)

Q1

Q2

…QK

timet1 t2 tK0 …

Q1 Q2 QK…

Incremental MQO (IMQO)

Page 8: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 8

Performing IMQOQ1

Q2

…QK

QN

SELECT …FROM …WHERE …

1. Index R 2. Identify common computations between

QN and R3. Select optimal sharing paths 4. Expand R with new computations

Query Network R

Page 9: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 9

Related Work Efficiency and

Scalability: Incremental evaluation:

Stream operators Join(Rete) [Forgy82] [Urhan

et al,00] [Viglas et al,03]

Aggregate [Haas et al,99] IMQO: Stream Processing

Projects NiagaraCQ, TelegraphCQ

[Chen et al,00] [Chandrasekaran et al,03]

STREAM, Aurora, Gigascope [Motwani et al,03][Abadi et al,03] [Cranor et al,03]

ARGUS[Jin et al,05][Jin et al,06] Practicality Comprehensive IMQO

framework Richer query syntax and

semantics Canonicalization More flexible plan

structures More general sharing

strategies

Page 10: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 10

Thesis Statement The thesis demonstrates constructively

that incremental multiple query optimization, incremental evaluation, and other query optimization techniques provide very significant performance improvements for large-scale continuous queries.

The methods can function atop existing DBMS systems for maximal modularity and direct practical utility.

The methods work well across diverse applications.

Page 11: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 11

Data Tables

Analyst

Input Streams

Query NetworkSystemCatalog

IMQOModule

SingleQueryOptimizer

CodeAssembler

PlanInstantiator

Register queries

Result streams

Register & initialize query network

ARGUS Query Network Generator ARGUS Execution Engine

ARGUS Stream Processing

Page 12: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 12

SystemCatalog

IncrementalMulti-QueryOptimizer

Single-QueryOptimizer

CodeAssembler

PlanInstantiator

ARGUS Query Network GeneratorParser

Canonicalizer

Index & SearchInterface

QueryRewriter

ARGUSManager

SQLQuery

Initiation and execution code

Query Network Generator

Page 13: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 13

Query Example Suppose for every big transaction of type

code 1000 or 2000, the analyst wants to check if the money stayed in the bank or left within twenty days. An additional sign of possible fraud is that the transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within twenty days of this transaction using an intermediate bank.

Page 14: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 14

The Query in CNFSELECT *FROM Fed r1, Fed r2, Fed r3WHERE

(r1.type_code = 1000 OR r1.type_code = 2000)AND r1.amount > 1000000AND (r2.type_code = 1000 OR r2.type_code = 2000)AND r2.amount > 500000AND (r3.type_code = 1000 OR r3.type_code = 2000)AND r3.amount > 500000AND r1.rbank_aba = r2.sbank_abaAND r1.benef_account = r2.orig_accountAND r2.amount > r1.amount / 2AND r1.tran_date <= r2.tran_dateAND r2.tran_date <= r1.tran_date + 20AND r2.rbank_aba = r3.sbank_abaAND r2.benef_account = r3.orig_accountAND r2.amount = r3.amountAND r2.tran_date <= r3.tran_dateAND r3.tran_date <= r2.tran_date + 20;

F S1 S2 J1 J2

S1S1S2

J1

J2

Page 15: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 15

Identify Sharable Computations

SELECT *FROM Fed r1, Fed r2, Fed r3WHERE

(r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_abaAND r1.benef_account = r2.orig_accountAND r2.amount * 2 > r1.amountAND r1.tran_date <= r2.tran_dateAND r2.tran_date - 10 <= r1.tran_dateAND r2.rbank_aba = r3.sbank_abaAND r2.benef_account = r3.orig_accountAND r2.amount = r3.amountAND r2.tran_date <= r3.tran_dateAND r3.tran_date - 10 <= r2.tran_date;

F S1 S2 J1 J2

1. Literal predicates1. Equivalency2. Subsumption

2. OR predicates3. Predicate sets4. Topology

Sharing strategiesSelf-join

r2.amount > r1.amount/2

r3.tran_date <= r2.tran_date + 20PJ1

ORp3ORp4ORp1ORp2ORp1ORp2

J3

J4

J4

Page 16: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 16

S1

PS1

S2

PS2

ORp1 ORp2 ORp4

p11 p2 p4p12

Computation Hierarchy

subsumption

subsumption

sharable

Fed.type_code = 1000 ORFed.type_code = 2000 Fed.amount > 1000000

subsumption

Fed.amount > 500000

Page 17: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 17Literal Pred

Associates

ORpid

psetid

type

name

text

OR Pred

Node

BelongsTo

BelongsTo

IsAChild

PredSet

pid

ER Model for Hierarchy

Page 18: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 18

Problems in Index/Search Rich syntax Canonicalization Subsumption

Literal predicate: subsumption + canonicalization

triple-string canonical form ORPred/PredSet algorithms

Self-join + canonicalization Standard Table Alias (STA) assignment

Topology multiple topology indexing

(Details coming up)

Page 19: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 19

Canonicalization Equivalency:

r2.amount > r1.amount / 2r2.amount *2 > r1.amount r2.amount * 2 – r1.amount > 0

Subsumption:r2.tran_date <= r1.tran_date + 20r2.tran_date – r1.tran_date <= 20r2.tran_date – 10 <= r1.tran_dater2.tran_date – r1.tran_date <= 10

Triple-string canonical form: attribute-expression op constant

Page 20: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 20

Self-Join Canonical forms refer to true table

names. Not good for self-join predicates:

r1.benef_account = r2.orig_accout Fed. benef_account = Fed.orig_accout

Use Standard Table Alias (STA) T1. benef_account = T2.orig_accout Enumerate STA assignments to find matches

Page 21: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 21

Self-Join in ORPred/PredSet Layers

OR Predicate: (r1.c=1000 OR r1.a=r2.b) (Fed.c=1000 OR T1.a=T2.b) ? (T1.c=1000 OR T1.a=T2.b) ?

Add STA when indexing OR Predicates

Similar on Predicate Sets

Page 22: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 22

Subsumption at ORPred Layer Input: ORPred p POutput: All ORPreds r R,

s.t. pr.Algorithm:

For each ρ p,Find γ r, such that ργ

For each r found, Count # of γ that subsumes ρ, |I(r)|If |I(r)|=|p|pr

Page 23: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 23

Topological Connections

B1

S2

S1

S4

S3

J1 J4 J7

S5

S6

Page 24: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 24

System CatalogNode

JVOA1 JVOA2 JVOAPSetID

DParent1

DParent2

DPSetID

Distinct

ORPredID

LPredID

LExpr

Op

RExpr

Node1

Node2

STA UseSTA

PSetID PredID STA

JoinTopologyIndex

PredicateSetIndex

PredicateIndex

Node

JVOA1 JVOA2 JVOAPSetID

DParent

DPSetID

SVOA

SVOAPSetID

Distinct

SelectionTopologyIndex

Page 25: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 25

Indexing & Searchingr2.type_code = 1000r3.type_code = 1000r1.type_code = 1000r1.amount > 1000000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account

r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date

r2.tran_date – 10 <= r1.tran_dater2.rbank_aba = r3.sbank_aba

r2.benef_account = r3.orig_accountr2.amount = r3.amount

r2.tran_date <= r3.tran_dater3.tran_date – 10 <= r2.tran_date

r1.type_code = 1000r1.amount > 1000000r2.type_code = 1000r2.amount > 500000r3.type_code = 1000r3.amount > 500000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account

r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date

r2.tran_date – 10 <= r1.tran_dater2.rbank_aba = r3.sbank_aba

r2.benef_account = r3.orig_accountr2.amount = r3.amount

r2.tran_date <= r3.tran_dater3.tran_date – 10 <= r2.tran_date

T2.amount * 2 – T1.amount > 0

T2.tran_date – T1.tran_date <= 10

System Catalog

PredID CanonicalForm …PredSetID PredID …Node PredSetID …

PredicateIndex

PredicateSetIndex

TopologyIndex

CanonicalizationInference & Classification

CommonComputation

Searching

ComputationIndexing

Page 26: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 26

Sharing Strategies

(a) Query network R

(b-2) Optimal plan for Q(c-2) Match-plan

J1B2

B1

B3J2 J3

(b-1) Joins in Q

1

2 (c-1) Sharing-selection

B2

B1

B3J2 J3

J1B2

B1

?B2

B1

B2

B3?

J1B2

B1

B3

J2

Page 27: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 27

Evaluation Databases:

Synthesized FedWire money transfers (Fed 500000 records)

Anonymized Medical patient admission records (Med 835890 records)

Queries: Seed queries Generate sharable queries from seeds A wide range of queries

Simulation: Historical data (300000 on Fed, 600000 on Med) Chunks of new data (4000 per chunk, etc.)

Page 28: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 28

Improvement Factors

DBMS 1x

ARGUS1-500x

Incremental Evaluation1-100x

ConditionalMaterialization1.2-1.8x

Join OrderOptimization1-10x

TransitivityInference1-20x

Canonicalization1-10x

IMQO1-50x

Page 29: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 29

0

5000

10000

15000

20000

25000

0 100 200 300 400 500 600 700 800

WQ

NS..

Fed IMQO & Canonicalization

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

0

50

100

150

200

250

0 100 200 300 400 500 600 700 800

# of queries

Exec

ution

Tim

e(s)

......

AllSharing NonCanon NonJoinS

# of queries

WQNS: weighted query network size

Page 30: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 30

0

1200

2400

3600

4800

6000

0 100 200 300 400 500 600 700 800# of queries

WQ

NS..

Fed Sharing Strategies

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800# of queries

Exec

ution

Tim

e (s

).....

.

SharingSel MatchPlan MatchPlan+NCanon

Page 31: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 31

Summary of Contributions Efficiency and scalability

Continuous queries Incremental query evaluation Multiple/large-scale queries Incremental multiple

query optimization (IMQO) Query optimization

Practicality Existing DB applications Built atop DBMSs A broad range of query syntax and semantics

Support Evaluation

Shows up-to hundreds-fold improvement Works across various domains

Page 32: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 32

Future Work Generalization of current work

Support multi-way joins More sophisticated sharing strategies

Rerouting Restructuring

Adaptive query processing Adaptive re-optimization: rerouting and restructuring Adaptive rescheduling

New infrastructure Parallel/distributive processing Automatic tuning: index selection

Support new data types Text Multimedia

Page 33: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 33

Acknowledgement Advisor: Jaime Carbonell. Committee: Chris Olston, Jamie Callan,

and Phil Hayes CMU and Dynamix ARGUS team: Jaime

Carbonell, Phil Hayes, Santosh Ananthraman, Cenk Gazen, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew, and Aaron Goldstein.

CMU faculty and friends: many …

Page 34: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 34

Thank you!

Questions and comments?

Page 35: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 35

Outline Motivation System and methods:

System architecture Execution engine Query network structures

IMQO framework Query network generator Query examples Hierarchy/ER Model Problems and solutions System catalog Sharing strategies

Evaluation Conclusion and future work

Page 36: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 36

Adapted Rete Algorithm (Join) Join on N and M (N+ΔN) (M+ΔM)

= N M + ΔN M + N ΔM + ΔN ΔM

When ΔN and ΔM are very small compared to N and M, time complexity of incremental join is O(N+M)

Old Results New Incremental Results

Page 37: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 37

N

M

J

Compute ΔJby ΔN M

N ΔMΔN ΔM

N

histnew

M

histnew

Jhist

new

N.rbank_aba = M.sbank_abaN.benef_account = M.orig_accountM.amount > N.amount*0.5N.tran_date <= M.tran_dateM.tran_date >= N.tran_date+20

Incremental Evaluation

ΔN

ΔM

ΔJ

Page 38: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 38

F S1 S2 J1 J2

F

histtemp

Compute S1_temp byselecting from F_temp Compute J1_temp by

joining S1_temp and S2_hist,joining S1_hist and S2_temp,

and joining S1_temp and S2_temp

S1

histtemp

S2

histtemp J1

histtemp

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_accountr2.amount > r1.amount*0.5r1.tran_date <= r2.tran_dater2.tran_date >= r1.tran_date+20

type_code=1000amount>500000

Incremental Evaluation

Page 39: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 39

Code Generation Code template for each operator Code block for each node Sort the code blocks Wrap up code blocks in Oracle

stored procedures Register and periodical execution

Page 40: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 40

Projection Management

B1

B2

S1

S2

J1

Page 41: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 41

Transitivity Inference Example Given

r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount

We can infer highly-selective predicates:

r2.amount > 500000 r3.amount > 500000

Page 42: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 42

Query Optimizer Similar to traditional enumeration-based

query optimizer Optimize

Join order Conditional materialization

Active List

Join Graph

StructureBuilder

JoinEnumerator

History-basedCost Estimator DB

SQL Query

Plan

Update System Catalog

History-basedQuery Optimizer

Page 43: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 43

Conditional Materialization

r2

r1

r2

r1

Unconditional Materialization

Conditional Materialization:Choose materialization or not based on cost estimates

Page 44: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 44

Selection/Join Incremental Evaluation (Fed)

0

10

20

30

40

50

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Exec

utio

n Ti

me(

s)

Rete Data1 DBMS Data1 Rete Data2 DBMS Data2

HP PC, Single core Pentium(R) 4 CPU, 1.7GHz, 512M RAM, Windows XP, Oracle 10.1.0

Page 45: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 45

Fed Comparing All

0

50

100

150

200

250

0 100 200 300 400 500 600 700 800

# of queries

Exec

ution

Tim

e(s)

......

AllSharing NonCanon NonJoinSMatchPlan MatchPlan+NCanon

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Page 46: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 46

Med Comparing All

0

30

60

90

120

0 100 200 300 400 500 600

# of queries

Exec

ution

Tim

e (s

).....

.

AllSharing NonCanonNonJoinS MatchPlanMatchPlan+NCanon

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Page 47: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 47

Med IMQO & Canonicalization

0300060009000

1200015000

0 100 200 300 400 500 600# of queries

WQ

NS ..

0

30

60

90

120

0 100 200 300 400 500 600

# of queries

Exec

ution

Tim

e (s

).....

.

AllSharing NonCanon NonJoinS

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0

Page 48: Optimizing  Multiple Continuous Queries

Chun Jin Carnegie Mellon 48

Med Sharing Strategies

0

20

40

60

80

0 100 200 300 400 500 600

# of queries

Exec

ution

Tim

e (s

).....

.

SharingSel MatchPlan MatchPlan+NCanon

0

2000

4000

6000

8000

0 100 200 300 400 500 600# of queries

WQ

NS ..

HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0


Recommended