Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 224 times |
Download: | 0 times |
Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources
Alasdair J G GrayInformation Management Group
University of Manchester
Seminar at University of Cardiff26 April 2010
2
Overview of the Talk• Motivation: SemSorGrid4Env
• Forms of heterogeneity
– Data source
– Data semantics
• Query processing
– SNEEql
– SNEE-DQP
– SPARQLSTR
26 April 2010
3
SemSorGrid4EnvSemantic sensor Grids for rapid
application development for environmental monitoring
• Coastal and estuarine flood warning• Fire monitoring and warning
26 April 2010
4
Estuarine Flood Warning• Sensors deployed along
UK South coast– On and off shore– Bespoke hardware– Fixed functionality– Fixed data rate
• No bursts– Central distribution centre
• Multitude of related data sources– Shipping– Flood defenses– Flooding models– …
26 April 2010
5
Fire DetectionDeployment: Castilla y León, Northwest Spain
– Forested region• Wireless sensor network
– Off-the-shelf sensor nodes• TMoteSky• TinyOS
– Configure dynamically: ad hoc queries– In-network query processing– Controlled rate variability
• Satellite image data for the region
26 April 2010
6
Abstract Problem
Stored dataSensor
Network
Integrator
26 April 2010
Sensor Network
Stored data service
Streaming data service
Streaming data service
7
Data sourceData
streamQuery
capabilitiesData
access
Types of Heterogeneity
Stored dataSensor
Network
Integrator
26 April 2010
Sensor Network
Stored data service
Streaming data service
Streaming data service
Data semantics
8
DATA SOURCE HETEROGENEITY
26 April 2010
9
Data Source Characteristics• Traditional stored data
– Data stored in a database– User observes a static data set– One-off query execution
• Streaming data– Data processed on-the-fly
• Maybe stored for later access– User observes changes in data set– Continuous or snap-shot query execution
26 April 2010
10
Types of Data StreamAcquire-Stream• Query controls data rate• Informs query planning
Receive-Stream• Source controls data rate
Potentially: – Unknown rate– Bursty data
26 April 2010
Stream Processor
Source
Acquire() Data
Stream Processor
Source
Data
11
Streaming Source Access
Pull Access
Consumer periodically polls for new data.
• Introduces processing delay
Push Access
Publisher sends data as it is produced.
• Minimises processing delays
26 April 2010
Note:• Orthogonal to data stream type• Affects physical operator selection
12
Query Processing Challenges• Variety of data sources
– Stored data– Receive-stream– Acquire-stream
• No common query semantics– Streaming data languages– Stored data languages
• Distributed data sources
26 April 2010
13
Query Languages• Stored (relational) data
– SQL• Streaming (relational) data
– SQL extensions– Continuous Query Language
(CQL)– Sensor NEtwork Engine query language
(SNEEql)
26 April 2010
14
Query Language: SQL• Designed for stored data• Contains blocking operators
– Join– Aggregates– …
• Example system: GSN (Aberer et al, 2007)
– Key concept: Virtual sensor • Wraps SQL execution• Controls periodic evaluation
– Limits expressiveness26 April 2010
15
Query Language: CQL(Arasu et al, 2006)
• Designed for receive-streams• Windows used for blocking operators• Contains type conversion operators
– Stream -> Window– Window -> Stream
• Semantics defined by implementation• Example system: STREAM (Arasu et al, 2010?)
– Data Stream Management System– No support for stored data
26 April 2010
16
Query Language: SNEEql(Brenninkmeijer et al, 2008)
• Designed for acquire-streams, receive-streams, and stored data– Based on CQL ideas
• Well-defined semantics– Independent of system
• Example system: SNEE (Galpin et al, 2009)
– In-network query evaluation– Acquire-streams– Reactive/periodic operators– Controls network behaviour
26 April 2010
17
SNEEql Query SyntaxSELECT {RSTREAM | DSTREAM | ISTREAM}+
attribute listFROM extent listWHERE expression
• *STREAM optional– Converts a window to a stream
• Extent list: – Streams with windows of the form
[FROM t1 TO t2 SLIDE s unit]– Relations with windows of the form
[SCAN EVERY t1 unit]26 April 2010
18
An Example from Hydrology• Investigating water drainage in the Peak District
– Hilly terrain– Peat bogs– River in valley bottom
• WSN measuring– Rainfall and – River depth
• WSN schema:– river (rain: int,
depth: int)• Sites (5, 6, 7, 9)
– hilltop (rain : int)• Sites (4)
26 April 2010
0
21
3 4
56
7
8
9
19
Example Multi-source QueryEvery 15 minutes, and within 24 hours of their being taken, we wish to obtain time-correlated measurements of the river depth now and the rainfall at the top of the hill 15 minutes before, provided that it is now raining less in the river than it was in the hill top, that the rainfall in the hill top was above 5mm and greater than average rainfall.
SELECT RSTREAM r.time, h.rain, r.depth FROM River[NOW] r, Hilltop[AT NOW-15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [rescan every day] WHERE weather.region = 'Peak District');Acquisition rate = 15 min; Max delivery time = 24 hours;
26 April 2010
20
SNEE-DQP Query Stack• Metadata
– Logical schema– Physical schema
• Source Allocation– Splitting the query
into parts for each data source
• Source Planning– Physical operator
selection– Generate plan for
source26 April 2010
Metadata
SNEEql query + QoS
Query Execution Plan
Parsing
Logical Planning
Source Allocation
Source Planning
21
Example: Query Plan
SELECT RSTREAM r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [RESCAN every day] WHERE weather.region = 'Peak District');
26 April 2010
• EXCHANGE
• JOINriver.rain<hilltop.rain
• ACQUIRE[time,rain]
rain > 5hilltop
• EVERY 15 min
• ACQUIRE[time,rain,
• depth] trueriver
• EVERY 15 min
• TIME_WINDOW[t-15, t-15, 15]
DELIVER
EXCHANGE
AVERAGE(rain)
SCAN[rain]
region = ‘Peak District’
weatherEVERY DAY
JOINh.rain >=
AVG(weather.rain)
22
In-Network SNEE• Two-phase DQP
– Single-site• Push /
– Multi-site• Steiner tree routing
to reduce energy• Operator placement
– Location sensitive– Reduce
transmission• Data buffering• Time division agenda• nesC code
generated26 April 2010
routing
parsing/type checking
translation/rewriting
algorithm assignment
partitioning
where-scheduling
when-scheduling
code generation
<query, QoS-expectations>,<cost parameters, schemas, description(node,network)>
<N1, …, Nm> nesC code
abstract-syntactic tree
logical-algebraic form
physical-algebraic form
PAFrouting tree
RT fragmented-algebraic form
agenda
1
2
3
4
5
6
7
8
RT distributed-algebraic form
RT DAF
single-site phase
multi-site
phase
27
SNEE now and futureSNEE Now• In-network SNEE
– Acquire-streams– Quality of service aware
• Expected lifetime• Total energy
consumption• Delivery time• Acquisition rate
– Runs on:• Simulators• TMoteSkys• TinyNodes
• Out-of-network SNEE– Receive-streams– Pull-based data sources
SNEE Future• SNEE-DQP (within
2010) – Combine in-network
and out-of-network versions
– Stored relations– Push-based data
sources• Beyond 2010
– Model building inside queries
– Greater resilience forin-network execution
26 April 2010
28
SEMANTIC HETEROGENEITY
26 April 2010
29
A Data Integration Approach• Heterogeneous
sources– Autonomous – Local schemas
• Homogeneous view– Mediated global
schema• Mapping
– Local-as-View– Global-as-View
26 April 2010
Global Schema
Query1 Queryn
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Mappings
Relies on agreement of a common global
schema
30
P2P Integration Approach• Heterogeneous sources
– Autonomous – Local schemas
• Heterogeneous views– Multiple schemas
• Mappings– From sources to
common schema– Between pairs of
schema• Require common
integration data model
26 April 2010
Schema1
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Schemaj
Query1 Queryn
Mappings
31
Semantic Integrator
SNEE-DQP
Query Translator
Data Translator
Query Resolver
Data Resolver
Q
[[Q]]
Q’
[[Q’]]
q
[[q]]
Tuples
Tuples
Semantic Integrator
26 April 2010
Streaming Source
Stored data
S2O Mappings
Tuples
Stored data
Streaming Source
Tuples
O2O Mappings
32
P2P Integration Approach• Heterogeneous sources
– Autonomous – Local schemas
• Heterogeneous views– Multiple schemas
• Mappings– From sources to
common schema– Between pairs of
schema• Require common
integration data modelCan RDF do this?26 April 2010
Schema1
DB1
Wrapper1
DBk
Wrapperk
DBi
Wrapperi
Schemaj
Query1 Queryn
Mappings
33
CAN RDB2RDF TOOLS FEASIBLY EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION?
A word of warning …
26 April 2010
34
RDB2RDF: Two ApproachesExtract-Transform-Load• Data replicated as RDF
– Data can become stale• Native SPARQL query
support– Limited optimisation
mechanisms
Existing RDF stores• Jena• Seasame
Query-driven Conversion• Data stored as relations
• Native SQL query support
– Highly optimised access methods
• SPARQL queries must be translated
Existing translation systems• D2RQ• SquirrelRDF
26 April 2010
35
Experiment• Time query evaluation
– Astronomy data set• ~500MB
– 5 real queries– No joins
• Systems compared:– Relational DB
• MySQL v5.1.25– RDB2RDF tools
• D2RQ v0.5.2• SquirrelRDF v0.1
– RDF Triple stores• Jena v2.5.6 (SDB)• Sesame v2.1.3 (Native)
26 April 2010
Relational DB
RDB2RDF
SPARQLquery
Triple store
SPARQLquery
Relational DB
SQLquery
36
Performance Results
26 April 2010# Query 1 # Query 2 # Query 3 # Query 5 # Query 60
100
200
300
400
500
600
700
800
900
1000
MySQLD2RQSqRDFJenaSesame
3,4
50
5,3
39
21
,49
24
85
,93
2
2,7
33
7,2
29
4,0
90
1,3
07
17
,79
3
7,4
68
19
,98
43
72
,56
1
37
The Show Stopper: Query Translation
• Each bound variable resulted in a self-join– RDBMS cannot optimize for this– RDBMS perform badly with self-joins
• Each row retrieved with a separate query– 1 query becomes n queries,
where n is cardinality of the relation• Predicate selection in RDB2RDF tool
– No optimization possible26 April 2010
38
CAN RDB2RDF TOOLS FEASIBLE EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION?
NOT CURRENTLY!
More work needed on query translation…(Gray et al, 2009)
26 April 2010
39
Conclusions• Query-based access to distributed
data sources, both streaming and stored
• SNEEql and SNEE-DQP overcome data source heterogeneity
• SPARQLSTR and Semantic Integrator overcome semantic heterogeneity
26 April 2010
40
Manchester TeamRAs• Ixent Galpin
• Alasdair J. G. Gray
PhDs• Christian Y. A.
Brenninkmeijer• Farhana Jabeen
Academics• Alvaro A. A.
Fernandes• Norman W. Paton
MSc Students• Jamil Naja• Varadarajan
Rajagopalan
26 April 2010
41
Acknowledgements• This work was/is funded
– UK EPSRC through • DIAS-MC project• Explicator project (University of Glasgow)
– European Commission as part of the SemSorGrid4Env project
• SNEE is released under a permissive open source license; please visit:http://code.google.com/p/snee/
26 April 2010
42
References1. K. Aberer, M. Hauswirth, and A. Salehi. Infrastructure for data processing in
large-scale interconnected sensor networks. In MDM 2007, pp198–205, 2007.2. A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani,
U. Srivastava, and J. Widom. Stream: The stanford data stream management system. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, (to appear).
3. A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal 15(2):121–142, 2006.
4. C. Y. A. Brenninkmeijer, I. Galpin, A. A. A. Fernandes, and N. W. Paton. A semantics for a query language over sensors, streams and relations. In BNCOD 25, pp87–99, 2008.
5. I. Galpin, C. Y. A. Brenninkmeijer, F. Jabeen, A. A. A. Fernandes, and N. W. Paton. Comprehensive optimization of declarative sensor network queries. In SSDBM 2009, pp339–360, 2009.
6. A. J. G. Gray, N. Gray, and I. Ounis. Can RDB2RDF tools feasibily expose large science archives for data integration? In ESWC 2009, pp491–505, 2009.
26 April 2010