ObjectGlobe Open, Secure, and QoS-enhanced
Distributed Query Processing
Donald Kossmann
Technical University of Munich
http://www3.in.tum.de
Joint work with Alfons Kemper (Passau) and others
Outline
• Background
• The ObjectGlobe Lookup Service
• (Security Aspects)
• QoS Management
• Summary
Query Processing on the Internet
• Web servers, relational databases on the Web: centralized or limited query capabilities
• Middleware Systems:a great deal of data shipping
• Goals of ObjectGlobe: – integrate any kind of data– integrate any kind of query processing capabilities– bring query processing capabilities to the data
Middleware for Query Processing
R
... ... ...
... ... ...
... ... ...S
Data-Provider A
T... ... ...... ... ...... ... ...
Data-Provider B
wrap_S
thumbnail
thumbnail
wrap_S
Use
r-d
efin
ed o
per
ato
rs
R... ... ...... ... ...
... ... ...
T... ... ...... ... ...... ... ...
Heavy data shipping
wrap_S
thumbnail
wrap_S
thumbnail
Fct-Provider
R... ... ...... ... ...
... ... ...S
Data-Provider AT
... ... ...
... ... ...
... ... ...
Data-Provider B
Open Query Processing (Step 1)
Load
functions
T... ... ...... ... ...
... ... ...
wrap_S
thumbnail
wrap_S
thumbnail
Fct-Provider
R... ... ...... ... ...
... ... ...S
Data-Provider AData-Provider B
Open Query Processing (Step 2)
Load
functions
Traveling from M to UCB
flights rental cars
Selection SelectionRoutenplaner
Route
Top N
Function Provider
Data Provider Data Provider
Cycle Provider
Open QP with ObjectGlobe
• Create an open marketplace for– data providers– cycle providers– function providers
• Requirements– wrappers exist for all data of data providers– JVM runs on all cycle providers– fixed interface for operators of function providers
Scenarios
• Free Internet: everything is free and available for everybody
• Restricted Internet: charge according to usage, quality, and timeliness; restrictions (e.g., age)
• Intranet: everything is free and available for „insiders“
• Outsourcing: charge for certain services (e.g., backup, business analyses)
Challenges
• Lookup Service– Find the relevant services
• Security – Protect data and cycle providers from bad code
• Quality of Service– What you pay is what you get
ObjectGlobe Lookup-Service
Lookup-Service
Parser OptimizerExecution
Engine
Application /User
Browse, Search
Authorisation,... Statistics, Cost Information, ...
Provider
Register
Description of Services
• Providers register RDF or XML documents
• There is a pre-defined schema to describe services
• Data Providers:– Theme (e.g., Hotel)– Attributes (e.g., rate, location, category)– Access paths and wrappers– Characteristics of the server (e.g., availability)– Information for authorization – Statistics– ...
• Function Provider:– Signature (e.g., foo(int, int) -> int)– Information for authorization– Hardware requirements (e.g., 30 MB main memory)– Size of Java byte code– ...
• Cycle Provider:– Hardware (e.g., 1 GB main memory)– Location and network connections / bandwidth– Information for authorization– ...
XML Description of a Data Provider
<DataProvider> <id> 4711 </id> <theme> <name> Hotel </name> <desc> All hotels you ever want </desc> </theme> <Attribute>
<topic> city </topic><type> string </type>
</Attribute>...
Lookup Query
• Data Providers for Hotels that return the City and Rate of each hotel
search DataProvider dselect d.uniqueId, d.attr.*where d.theme.name = „hotel“ and d.attr.?.topic = „city“ and d.attr.?.topic = „rate“
Three-tier Architecture
• Local Lookup-Servers– Keep copies of meta-data of services that are relevant
for a particular organization or subsidary– Evaluate Lookup requests for that organization– Relevance is determined by subscription rules (queries)
• Public Lookup-Servers (Backbone)– Store all (public) meta-data– Store subscription rules of local Lookup-Servers– Notify local Lookup-Servers of changes – Users can browse in the public info of the backbone
Three-tier Architecture
PublicLookup-Server
PublicLookup-Server
Local LS Local LS Local LS
New Rules Answers
Client Client Client ClientClient
QueriesAnswersNew Rules
Updates, Inserts
• Processing Lookup Requests– Local Lookup-Servers store meta-data in RDBMS– Translate Lookup request into SQL
• Registering new services– Public Lookup-Servers store meta-data in RDBMS– Public Lookup-Servers store rules in RDBMS– Apply filter algorithm using RDBMS in order to find
relevant local Lookup-Servers
• Deletes and updates of services– Apply filter algorithm to find affected local Lookup-
Servers (more complicated, however)
• Principle: Map everything to RDBMS
<person, id = 4711>
<name> Lilly Potter </name>
<child> <person, id = 314>
<name> Harry Potter </name>
</child>
</person>
<person, id = 666>
<name> James Potter </name>
<child> 314 </child>
</person>
person person
Harry Potter
name
name name
person
Lilly Potter James Potter
child
314
0
4711 666
i314
Storing XML Data in an RDBMS
Edge Approach
Source Label Target
0 person 4711
0 person 666
4711 name v1
4711 child i314
666 name v2
666 child i314
Id Value
v1 Lilly Potter
v2 James Potter
v3 Harry Potter
Id Value
v4 12
Edge Table Value Table (String)
Value Table (Integer)
XML Queries
• Find the name of all persons that like to play Quidditch and are younger than 18 years
select $nwhere <person>
<name> $n </name><age> $a </age><hobby> Quidditch </hobby>
</person>, $a < 18
• Carry out pattern matching with „document graph“
Translation to SQL
SELECT nv.value FROM Edge p, Edge n, Edge h, Value nv, Value hvWHERE p.label = „person“ AND p.target = n.source AND
n.label = „name“ AND n.target = nv.id AND
p.target = h.source AND h.label = „hobby“ AND h.target = hv.id AND hv.value = „Quidditch“;
Works essentially in the same way for the query language of ourLookup service.
Publish & Subscribe Algorithm
• Decompose subscription rules and store them in RDMBS of Public Lookup-Servers
• SQL Join-Queries in order to match sub-rules with meta-data objects(Recall: meta-data is decomposed, too)
• SQL Join-Queries in order to re-construct matching subscription rules from sub-rules
Decomposition of Subscription Rules
• Data Providers for Stock Market Information that cost less than 500 Dollars:search DataProvider dwhere d.theme.name = „Stock Market“ and d.cost < 500
• Decomposition into three atomic rules:R1: search Theme t where t.name = „Börse“R2: search DataProvider d where d.cost < 500R3: search R1 a, R2 b where b.theme = a
• Store these rules in RDBMSRule Class Operator Attribute Value
R1 Theme = name Stock Mkt
R2 DataProv. < cost 500
MatchingRule Class Operator Attribute Value
R1 Theme = name Stock Mkt
R2 DataProv. < cost 500
Object Type Attribute Value
O1 Theme name Stock Mkt
O1 Theme description SE InfoSys
O2 DataProv. theme O1
O2 DataProv. attr O3 (kurs)
O2 DataProv. attr O4 (wkn)
O2 DataProv. cost 70
Result of Join: (R1, O1); (R2, O2)
Re-constructing Subscription Rulesfrom matching atomic sub-rules
• Store decomposition graph in RDMBS– higher-level and atomic rules are vertices– Top-level rules are so-called triggering rules;
if they are affected, notify LLS
• Walk „bottom up“ through decomposition graph– SQL-Join Query: for each pair of matching rules, find
out whether they have a common parent– N.B. the decomposition graph is a binary directed,
acyclic graph
Preliminary Experiments
• Synthetic benchmark database with 100.000 (different) subscription rules
• Oracle 8i used in the Public Lookup Server
# new providers Proc. Time (PLS)
1 250 msecs
100 (batch) 5000 msecs
Batch updates are crucial
Summary
• Basic Principle: decompose rules and data
• Advantages:– Generic, independent of schema– Very easy to implement, no administration needed– Exploit query capabilities of RDBMS– Need not worry about document boundaries– Finding common sub-rules is trivial
• Disadvantage:– Sub-optimal query performance (many Joins)
but probably sufficient, if updates are batched
Related Work
• Lookup Services: Jini, UDDI, Plug & Play
• Publish & Subscribe:– IR world– SIFT (Stanford)– XFilter (Berkeley)– LeSelect (INRIA)– Continuous Queries (Niagra, ...)
• Storing and Indexing XML Data: ...
Outline
• Background
• The ObjectGlobe Lookup Service
• (Security Aspects)
• QoS Management
• Summary
Security Requirements in ObjectGlobe
• Protection of Data and Cycle Providers• Secure Communication
– use SSL connections (authenticated and encrypted)• Authentication of Clients
– passwords / certificates – digitally signed requests (query subplans)
• Authorization control– data/cycle providers are autonomous– but register user privileges in lookup service
Security of Data/Cycle Providers
ObjectGlobe
runtime
system
Class
loader
Class
loader
Class
loader
Internal
class
loader
Secure
sandbox
Internet
Query 1
Query 2
Query 3
Privileged Built-inOperatorsfor Disk or Network Access
sandbox
externaloperator
Internaloperator
tmpfile
QoS Management
• State of the Art: best-effort
• Goal: users should be able to constrain– Cost of execution– Running time – Quality of the results
• Initial approach (to get a feeling)– extended query optimization– Admission control– Monitoring and plan adaptions at execution time
• Real solution: ???
Quality Parameters
• Cost of execution– $
• Running time– First tuple, last tuple, Nth tuple
• Quality of the results– Number of results– Coverage: Number (or %) of data sources queried– Staleness of data
• Cost as a function of coverage (-> Mariposa)
• Cost as a function of #wheels (Mercedes)
Quality of Service-Parameters
Completeness
Cost
(€
)
min
max
max
Respo
nse
time
Desired
space for
query plans
Extended Query Optimization
Bottom-up dynamic programming query optimizer,standard costing etc., and the following extensions1. Generate alternatives for each operator
• Consider classes of equivalent providers
2. Extended Pruning, Heuristics for choosing a Winner
3. Enumerate „incomplete“ UNIONs4. Initialize QoS-Accounts
QoS-Annotated Query Plan
scan scan
thumbnail
display
host=A.com
host=client
host=client
host=A.com
host=B.com
host=B.com
wrap_Shost=A.com
host=B.comcost timeOtimeN
cost timeOtimeN
cost timeOtimeN
QoS Accounts
Optimization: Open Questions
• Revisit heuristics to choose winning plan– Dynamic heuristics depending on workload
and/or feedback
• Reverse engineering a plan– How much data should a plan read if the cost
should be $5.00?
• Does query optimization matter?
Admission Control & Monitoring
• Admission Control:– Check assumptions of optimizer – Carried out at plan instantiation time for each
plan fragment (set of operators at one site)
• Monitoring:– Predict quality of results at the end of execution– Carried out by special Monitoring operators
• Take actions if violations are detected– ECA rules specify actions
Monitoring Operators
• at the end of pipelines• are non-blocking / low cost • above „receive“ ops• keep statistics for predictions• differentiate between „open“
and „next“ phase• Communicate with each
other for liveliness monitor
A
send
monitor
B
send
monitor
receive
Join
Plan Adaptions• General: Abort, Restart / Reoptimize
• Response Time Violation:– compressConnection – movePlan (w/wo state)– increasePriority– removeTempResults, ...
• Coverage / Result Quality Violation:– addSubPlan
• Cost Violation:– movePlan, decreasePriority, ...
ECA Rules for Adaptions
if cost is high and coverage is low then abort
if cost is high and coverage is high then delResults
if rt is high and cost is low and network is critical then compress
Plan Adaptions: Open Questions
• What is the right mix of actions?
• What are the right thresholds for the rules?
• How to avoid the „Schweinezyklus“?
• How to draw the right conclusions from the statistics produced by Monitoring?
• What is the right granularity of actions?Plan vs. Operator vs. Tuple
Project Status
• First demo presented at SIGMOD 99– Travel information– Four Web data sources (hotels, sights, train conns)– One function provider (travel routes, top N)– Three cycle providers (two in Europe, one in US)
• Online-Demo: http://db.fmi.uni-passau.de/projects/OG
• Current work: more experiments– Problem: getting data from Web sources is sloooow