Optimizing Multiple Continuous Queries
Dissertation Defense
Chun Jin
Thesis CommitteeJaime Carbonell (Chair)Christopher Olston, on leave at Yahoo! ResearchJamie CallanPhil Hayes, Vivisimo, Inc.
October 31, 2006, Carnegie Mellon
Chun Jin Carnegie Mellon 2
Emerging Stream Applications
•Intelligence monitoring•Fraud detection•Onset epidemic patterns•Network intrusion detection•GeoSpatial change detection
•Transactions•Senor network readings•Network traffic data
Chun Jin Carnegie Mellon 3Analyst A Analyst B
Stream MatchingContinuous
Queries
Terr
oris
m A
lert
s
Fraud Alerts
Novelty Detection
New Connections
New P
atte
rns
Ad hoc Query Matching
New Continuous
Queries
Data Streams
Ad h
oc e
xplo
ring
ARGUS: Toward Collaborative Intelligence Analysis
Chun Jin Carnegie Mellon 4
Challenges Large-Scale (~103) continuous queries On FAST (104-105tuples/day) continuous
streams With LARGE (~106tuples) historical DBs.… but computation-sharable and highly-
selective queries Support stream processing for a broad
range of queries on existing DB applications.
… but DBMS technologies.
Chun Jin Carnegie Mellon 5
Problems Efficiency and scalability
Continuous query evaluation Multiple/Large-scale queries
Practicality Utilize DBMS legacy systems to
support stream processing on a broad range of queries.
Chun Jin Carnegie Mellon 6
Approaches Efficiency and scalability
Incremental query evaluation Incremental multiple query optimization
(IMQO) Query optimization
Practicality Built atop DBMSs Use SQL as the query language
Shows up-to hundreds-fold improvement(Details coming up)
Selection/join queries
Chun Jin Carnegie Mellon 7
MQO is NP-hard![Sellis90]
Challenges to Multiple Query Optimization (MQO)
Q1
Q2
…QK
timet1 t2 tK0 …
Q1 Q2 QK…
Incremental MQO (IMQO)
Chun Jin Carnegie Mellon 8
Performing IMQOQ1
Q2
…QK
QN
SELECT …FROM …WHERE …
1. Index R 2. Identify common computations between
QN and R3. Select optimal sharing paths 4. Expand R with new computations
Query Network R
Chun Jin Carnegie Mellon 9
Related Work Efficiency and
Scalability: Incremental evaluation:
Stream operators Join(Rete) [Forgy82] [Urhan
et al,00] [Viglas et al,03]
Aggregate [Haas et al,99] IMQO: Stream Processing
Projects NiagaraCQ, TelegraphCQ
[Chen et al,00] [Chandrasekaran et al,03]
STREAM, Aurora, Gigascope [Motwani et al,03][Abadi et al,03] [Cranor et al,03]
ARGUS[Jin et al,05][Jin et al,06] Practicality Comprehensive IMQO
framework Richer query syntax and
semantics Canonicalization More flexible plan
structures More general sharing
strategies
Chun Jin Carnegie Mellon 10
Thesis Statement The thesis demonstrates constructively
that incremental multiple query optimization, incremental evaluation, and other query optimization techniques provide very significant performance improvements for large-scale continuous queries.
The methods can function atop existing DBMS systems for maximal modularity and direct practical utility.
The methods work well across diverse applications.
Chun Jin Carnegie Mellon 11
Data Tables
Analyst
Input Streams
Query NetworkSystemCatalog
IMQOModule
SingleQueryOptimizer
CodeAssembler
PlanInstantiator
Register queries
Result streams
Register & initialize query network
ARGUS Query Network Generator ARGUS Execution Engine
ARGUS Stream Processing
Chun Jin Carnegie Mellon 12
SystemCatalog
IncrementalMulti-QueryOptimizer
Single-QueryOptimizer
CodeAssembler
PlanInstantiator
ARGUS Query Network GeneratorParser
Canonicalizer
Index & SearchInterface
QueryRewriter
ARGUSManager
SQLQuery
Initiation and execution code
Query Network Generator
Chun Jin Carnegie Mellon 13
Query Example Suppose for every big transaction of type
code 1000 or 2000, the analyst wants to check if the money stayed in the bank or left within twenty days. An additional sign of possible fraud is that the transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within twenty days of this transaction using an intermediate bank.
Chun Jin Carnegie Mellon 14
The Query in CNFSELECT *FROM Fed r1, Fed r2, Fed r3WHERE
(r1.type_code = 1000 OR r1.type_code = 2000)AND r1.amount > 1000000AND (r2.type_code = 1000 OR r2.type_code = 2000)AND r2.amount > 500000AND (r3.type_code = 1000 OR r3.type_code = 2000)AND r3.amount > 500000AND r1.rbank_aba = r2.sbank_abaAND r1.benef_account = r2.orig_accountAND r2.amount > r1.amount / 2AND r1.tran_date <= r2.tran_dateAND r2.tran_date <= r1.tran_date + 20AND r2.rbank_aba = r3.sbank_abaAND r2.benef_account = r3.orig_accountAND r2.amount = r3.amountAND r2.tran_date <= r3.tran_dateAND r3.tran_date <= r2.tran_date + 20;
F S1 S2 J1 J2
S1S1S2
J1
J2
Chun Jin Carnegie Mellon 15
Identify Sharable Computations
SELECT *FROM Fed r1, Fed r2, Fed r3WHERE
(r1.type_code = 1000 OR r1.type_code = 2000) AND r1.amount > 1000000 AND (r2.type_code = 1000 OR r2.type_code = 2000) AND r2.amount > 500000 AND (r3.type_code = 1000 OR r3.type_code = 2000) AND r3.amount > 500000 AND r1.rbank_aba = r2.sbank_abaAND r1.benef_account = r2.orig_accountAND r2.amount * 2 > r1.amountAND r1.tran_date <= r2.tran_dateAND r2.tran_date - 10 <= r1.tran_dateAND r2.rbank_aba = r3.sbank_abaAND r2.benef_account = r3.orig_accountAND r2.amount = r3.amountAND r2.tran_date <= r3.tran_dateAND r3.tran_date - 10 <= r2.tran_date;
F S1 S2 J1 J2
1. Literal predicates1. Equivalency2. Subsumption
2. OR predicates3. Predicate sets4. Topology
Sharing strategiesSelf-join
r2.amount > r1.amount/2
r3.tran_date <= r2.tran_date + 20PJ1
ORp3ORp4ORp1ORp2ORp1ORp2
J3
J4
J4
Chun Jin Carnegie Mellon 16
S1
PS1
S2
PS2
ORp1 ORp2 ORp4
p11 p2 p4p12
Computation Hierarchy
subsumption
subsumption
sharable
Fed.type_code = 1000 ORFed.type_code = 2000 Fed.amount > 1000000
subsumption
Fed.amount > 500000
Chun Jin Carnegie Mellon 17Literal Pred
Associates
ORpid
psetid
type
name
text
OR Pred
Node
BelongsTo
BelongsTo
IsAChild
PredSet
pid
ER Model for Hierarchy
Chun Jin Carnegie Mellon 18
Problems in Index/Search Rich syntax Canonicalization Subsumption
Literal predicate: subsumption + canonicalization
triple-string canonical form ORPred/PredSet algorithms
Self-join + canonicalization Standard Table Alias (STA) assignment
Topology multiple topology indexing
(Details coming up)
Chun Jin Carnegie Mellon 19
Canonicalization Equivalency:
r2.amount > r1.amount / 2r2.amount *2 > r1.amount r2.amount * 2 – r1.amount > 0
Subsumption:r2.tran_date <= r1.tran_date + 20r2.tran_date – r1.tran_date <= 20r2.tran_date – 10 <= r1.tran_dater2.tran_date – r1.tran_date <= 10
Triple-string canonical form: attribute-expression op constant
Chun Jin Carnegie Mellon 20
Self-Join Canonical forms refer to true table
names. Not good for self-join predicates:
r1.benef_account = r2.orig_accout Fed. benef_account = Fed.orig_accout
Use Standard Table Alias (STA) T1. benef_account = T2.orig_accout Enumerate STA assignments to find matches
Chun Jin Carnegie Mellon 21
Self-Join in ORPred/PredSet Layers
OR Predicate: (r1.c=1000 OR r1.a=r2.b) (Fed.c=1000 OR T1.a=T2.b) ? (T1.c=1000 OR T1.a=T2.b) ?
Add STA when indexing OR Predicates
Similar on Predicate Sets
Chun Jin Carnegie Mellon 22
Subsumption at ORPred Layer Input: ORPred p POutput: All ORPreds r R,
s.t. pr.Algorithm:
For each ρ p,Find γ r, such that ργ
For each r found, Count # of γ that subsumes ρ, |I(r)|If |I(r)|=|p|pr
Chun Jin Carnegie Mellon 23
Topological Connections
B1
S2
S1
S4
S3
J1 J4 J7
S5
S6
Chun Jin Carnegie Mellon 24
System CatalogNode
JVOA1 JVOA2 JVOAPSetID
DParent1
DParent2
DPSetID
Distinct
ORPredID
LPredID
LExpr
Op
RExpr
Node1
Node2
STA UseSTA
PSetID PredID STA
JoinTopologyIndex
PredicateSetIndex
PredicateIndex
Node
JVOA1 JVOA2 JVOAPSetID
DParent
DPSetID
SVOA
SVOAPSetID
Distinct
SelectionTopologyIndex
Chun Jin Carnegie Mellon 25
Indexing & Searchingr2.type_code = 1000r3.type_code = 1000r1.type_code = 1000r1.amount > 1000000
r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account
r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date
r2.tran_date – 10 <= r1.tran_dater2.rbank_aba = r3.sbank_aba
r2.benef_account = r3.orig_accountr2.amount = r3.amount
r2.tran_date <= r3.tran_dater3.tran_date – 10 <= r2.tran_date
r1.type_code = 1000r1.amount > 1000000r2.type_code = 1000r2.amount > 500000r3.type_code = 1000r3.amount > 500000
r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account
r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date
r2.tran_date – 10 <= r1.tran_dater2.rbank_aba = r3.sbank_aba
r2.benef_account = r3.orig_accountr2.amount = r3.amount
r2.tran_date <= r3.tran_dater3.tran_date – 10 <= r2.tran_date
T2.amount * 2 – T1.amount > 0
T2.tran_date – T1.tran_date <= 10
System Catalog
PredID CanonicalForm …PredSetID PredID …Node PredSetID …
PredicateIndex
PredicateSetIndex
TopologyIndex
CanonicalizationInference & Classification
CommonComputation
Searching
ComputationIndexing
Chun Jin Carnegie Mellon 26
Sharing Strategies
(a) Query network R
(b-2) Optimal plan for Q(c-2) Match-plan
J1B2
B1
B3J2 J3
(b-1) Joins in Q
1
2 (c-1) Sharing-selection
B2
B1
B3J2 J3
J1B2
B1
?B2
B1
B2
B3?
J1B2
B1
B3
J2
Chun Jin Carnegie Mellon 27
Evaluation Databases:
Synthesized FedWire money transfers (Fed 500000 records)
Anonymized Medical patient admission records (Med 835890 records)
Queries: Seed queries Generate sharable queries from seeds A wide range of queries
Simulation: Historical data (300000 on Fed, 600000 on Med) Chunks of new data (4000 per chunk, etc.)
Chun Jin Carnegie Mellon 28
Improvement Factors
DBMS 1x
ARGUS1-500x
Incremental Evaluation1-100x
ConditionalMaterialization1.2-1.8x
Join OrderOptimization1-10x
TransitivityInference1-20x
Canonicalization1-10x
IMQO1-50x
Chun Jin Carnegie Mellon 29
0
5000
10000
15000
20000
25000
0 100 200 300 400 500 600 700 800
WQ
NS..
Fed IMQO & Canonicalization
HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
0
50
100
150
200
250
0 100 200 300 400 500 600 700 800
# of queries
Exec
ution
Tim
e(s)
......
AllSharing NonCanon NonJoinS
# of queries
WQNS: weighted query network size
Chun Jin Carnegie Mellon 30
0
1200
2400
3600
4800
6000
0 100 200 300 400 500 600 700 800# of queries
WQ
NS..
Fed Sharing Strategies
HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800# of queries
Exec
ution
Tim
e (s
).....
.
SharingSel MatchPlan MatchPlan+NCanon
Chun Jin Carnegie Mellon 31
Summary of Contributions Efficiency and scalability
Continuous queries Incremental query evaluation Multiple/large-scale queries Incremental multiple
query optimization (IMQO) Query optimization
Practicality Existing DB applications Built atop DBMSs A broad range of query syntax and semantics
Support Evaluation
Shows up-to hundreds-fold improvement Works across various domains
Chun Jin Carnegie Mellon 32
Future Work Generalization of current work
Support multi-way joins More sophisticated sharing strategies
Rerouting Restructuring
Adaptive query processing Adaptive re-optimization: rerouting and restructuring Adaptive rescheduling
New infrastructure Parallel/distributive processing Automatic tuning: index selection
Support new data types Text Multimedia
Chun Jin Carnegie Mellon 33
Acknowledgement Advisor: Jaime Carbonell. Committee: Chris Olston, Jamie Callan,
and Phil Hayes CMU and Dynamix ARGUS team: Jaime
Carbonell, Phil Hayes, Santosh Ananthraman, Cenk Gazen, Bob Frederking, Eugene Fink, Dwight Dietrich, Ganesh Mani, Johny Mathew, and Aaron Goldstein.
CMU faculty and friends: many …
Chun Jin Carnegie Mellon 34
Thank you!
Questions and comments?
Chun Jin Carnegie Mellon 35
Outline Motivation System and methods:
System architecture Execution engine Query network structures
IMQO framework Query network generator Query examples Hierarchy/ER Model Problems and solutions System catalog Sharing strategies
Evaluation Conclusion and future work
Chun Jin Carnegie Mellon 36
Adapted Rete Algorithm (Join) Join on N and M (N+ΔN) (M+ΔM)
= N M + ΔN M + N ΔM + ΔN ΔM
When ΔN and ΔM are very small compared to N and M, time complexity of incremental join is O(N+M)
Old Results New Incremental Results
Chun Jin Carnegie Mellon 37
N
M
J
Compute ΔJby ΔN M
N ΔMΔN ΔM
N
histnew
M
histnew
Jhist
new
N.rbank_aba = M.sbank_abaN.benef_account = M.orig_accountM.amount > N.amount*0.5N.tran_date <= M.tran_dateM.tran_date >= N.tran_date+20
Incremental Evaluation
ΔN
ΔM
ΔJ
Chun Jin Carnegie Mellon 38
F S1 S2 J1 J2
F
histtemp
Compute S1_temp byselecting from F_temp Compute J1_temp by
joining S1_temp and S2_hist,joining S1_hist and S2_temp,
and joining S1_temp and S2_temp
S1
histtemp
S2
histtemp J1
histtemp
r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_accountr2.amount > r1.amount*0.5r1.tran_date <= r2.tran_dater2.tran_date >= r1.tran_date+20
type_code=1000amount>500000
Incremental Evaluation
Chun Jin Carnegie Mellon 39
Code Generation Code template for each operator Code block for each node Sort the code blocks Wrap up code blocks in Oracle
stored procedures Register and periodical execution
Chun Jin Carnegie Mellon 40
Projection Management
B1
B2
S1
S2
J1
Chun Jin Carnegie Mellon 41
Transitivity Inference Example Given
r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount
We can infer highly-selective predicates:
r2.amount > 500000 r3.amount > 500000
Chun Jin Carnegie Mellon 42
Query Optimizer Similar to traditional enumeration-based
query optimizer Optimize
Join order Conditional materialization
Active List
Join Graph
StructureBuilder
JoinEnumerator
History-basedCost Estimator DB
SQL Query
Plan
Update System Catalog
History-basedQuery Optimizer
Chun Jin Carnegie Mellon 43
Conditional Materialization
r2
r1
r2
r1
Unconditional Materialization
Conditional Materialization:Choose materialization or not based on cost estimates
Chun Jin Carnegie Mellon 44
Selection/Join Incremental Evaluation (Fed)
0
10
20
30
40
50
Q1 Q2 Q3 Q4 Q5 Q6 Q7
Exec
utio
n Ti
me(
s)
Rete Data1 DBMS Data1 Rete Data2 DBMS Data2
HP PC, Single core Pentium(R) 4 CPU, 1.7GHz, 512M RAM, Windows XP, Oracle 10.1.0
Chun Jin Carnegie Mellon 45
Fed Comparing All
0
50
100
150
200
250
0 100 200 300 400 500 600 700 800
# of queries
Exec
ution
Tim
e(s)
......
AllSharing NonCanon NonJoinSMatchPlan MatchPlan+NCanon
HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
Chun Jin Carnegie Mellon 46
Med Comparing All
0
30
60
90
120
0 100 200 300 400 500 600
# of queries
Exec
ution
Tim
e (s
).....
.
AllSharing NonCanonNonJoinS MatchPlanMatchPlan+NCanon
HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
Chun Jin Carnegie Mellon 47
Med IMQO & Canonicalization
0300060009000
1200015000
0 100 200 300 400 500 600# of queries
WQ
NS ..
0
30
60
90
120
0 100 200 300 400 500 600
# of queries
Exec
ution
Tim
e (s
).....
.
AllSharing NonCanon NonJoinS
HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0
Chun Jin Carnegie Mellon 48
Med Sharing Strategies
0
20
40
60
80
0 100 200 300 400 500 600
# of queries
Exec
ution
Tim
e (s
).....
.
SharingSel MatchPlan MatchPlan+NCanon
0
2000
4000
6000
8000
0 100 200 300 400 500 600# of queries
WQ
NS ..
HP PC, Single core Pentium(R) 4 CPU, 3.00GHz, 1G RAM, Windows XP, Oracle 10.1.0