+ All Categories
Home > Documents > Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray...

Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray...

Date post: 19-Dec-2015
Category:
View: 224 times
Download: 0 times
Share this document with a friend
Popular Tags:
38
Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester Seminar at University of Cardiff 26 April 2010
Transcript
Page 1: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources

Alasdair J G GrayInformation Management Group

University of Manchester

Seminar at University of Cardiff26 April 2010

Page 2: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

2

Overview of the Talk• Motivation: SemSorGrid4Env

• Forms of heterogeneity

– Data source

– Data semantics

• Query processing

– SNEEql

– SNEE-DQP

– SPARQLSTR

26 April 2010

Page 3: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

3

SemSorGrid4EnvSemantic sensor Grids for rapid

application development for environmental monitoring

• Coastal and estuarine flood warning• Fire monitoring and warning

26 April 2010

Page 4: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

4

Estuarine Flood Warning• Sensors deployed along

UK South coast– On and off shore– Bespoke hardware– Fixed functionality– Fixed data rate

• No bursts– Central distribution centre

• Multitude of related data sources– Shipping– Flood defenses– Flooding models– …

26 April 2010

Page 5: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

5

Fire DetectionDeployment: Castilla y León, Northwest Spain

– Forested region• Wireless sensor network

– Off-the-shelf sensor nodes• TMoteSky• TinyOS

– Configure dynamically: ad hoc queries– In-network query processing– Controlled rate variability

• Satellite image data for the region

26 April 2010

Page 6: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

6

Abstract Problem

Stored dataSensor

Network

Integrator

26 April 2010

Sensor Network

Stored data service

Streaming data service

Streaming data service

Page 7: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

7

Data sourceData

streamQuery

capabilitiesData

access

Types of Heterogeneity

Stored dataSensor

Network

Integrator

26 April 2010

Sensor Network

Stored data service

Streaming data service

Streaming data service

Data semantics

Page 8: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

8

DATA SOURCE HETEROGENEITY

26 April 2010

Page 9: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

9

Data Source Characteristics• Traditional stored data

– Data stored in a database– User observes a static data set– One-off query execution

• Streaming data– Data processed on-the-fly

• Maybe stored for later access– User observes changes in data set– Continuous or snap-shot query execution

26 April 2010

Page 10: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

10

Types of Data StreamAcquire-Stream• Query controls data rate• Informs query planning

Receive-Stream• Source controls data rate

Potentially: – Unknown rate– Bursty data

26 April 2010

Stream Processor

Source

Acquire() Data

Stream Processor

Source

Data

Page 11: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

11

Streaming Source Access

Pull Access

Consumer periodically polls for new data.

• Introduces processing delay

Push Access

Publisher sends data as it is produced.

• Minimises processing delays

26 April 2010

Note:• Orthogonal to data stream type• Affects physical operator selection

Page 12: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

12

Query Processing Challenges• Variety of data sources

– Stored data– Receive-stream– Acquire-stream

• No common query semantics– Streaming data languages– Stored data languages

• Distributed data sources

26 April 2010

Page 13: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

13

Query Languages• Stored (relational) data

– SQL• Streaming (relational) data

– SQL extensions– Continuous Query Language

(CQL)– Sensor NEtwork Engine query language

(SNEEql)

26 April 2010

Page 14: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

14

Query Language: SQL• Designed for stored data• Contains blocking operators

– Join– Aggregates– …

• Example system: GSN (Aberer et al, 2007)

– Key concept: Virtual sensor • Wraps SQL execution• Controls periodic evaluation

– Limits expressiveness26 April 2010

Page 15: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

15

Query Language: CQL(Arasu et al, 2006)

• Designed for receive-streams• Windows used for blocking operators• Contains type conversion operators

– Stream -> Window– Window -> Stream

• Semantics defined by implementation• Example system: STREAM (Arasu et al, 2010?)

– Data Stream Management System– No support for stored data

26 April 2010

Page 16: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

16

Query Language: SNEEql(Brenninkmeijer et al, 2008)

• Designed for acquire-streams, receive-streams, and stored data– Based on CQL ideas

• Well-defined semantics– Independent of system

• Example system: SNEE (Galpin et al, 2009)

– In-network query evaluation– Acquire-streams– Reactive/periodic operators– Controls network behaviour

26 April 2010

Page 17: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

17

SNEEql Query SyntaxSELECT {RSTREAM | DSTREAM | ISTREAM}+

attribute listFROM extent listWHERE expression

• *STREAM optional– Converts a window to a stream

• Extent list: – Streams with windows of the form

[FROM t1 TO t2 SLIDE s unit]– Relations with windows of the form

[SCAN EVERY t1 unit]26 April 2010

Page 18: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

18

An Example from Hydrology• Investigating water drainage in the Peak District

– Hilly terrain– Peat bogs– River in valley bottom

• WSN measuring– Rainfall and – River depth

• WSN schema:– river (rain: int,

depth: int)• Sites (5, 6, 7, 9)

– hilltop (rain : int)• Sites (4)

26 April 2010

0

21

3 4

56

7

8

9

Page 19: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

19

Example Multi-source QueryEvery 15 minutes, and within 24 hours of their being taken, we wish to obtain time-correlated measurements of the river depth now and the rainfall at the top of the hill 15 minutes before, provided that it is now raining less in the river than it was in the hill top, that the rainfall in the hill top was above 5mm and greater than average rainfall.

SELECT RSTREAM r.time, h.rain, r.depth FROM River[NOW] r, Hilltop[AT NOW-15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [rescan every day] WHERE weather.region = 'Peak District');Acquisition rate = 15 min; Max delivery time = 24 hours;

26 April 2010

Page 20: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

20

SNEE-DQP Query Stack• Metadata

– Logical schema– Physical schema

• Source Allocation– Splitting the query

into parts for each data source

• Source Planning– Physical operator

selection– Generate plan for

source26 April 2010

Metadata

SNEEql query + QoS

Query Execution Plan

Parsing

Logical Planning

Source Allocation

Source Planning

Page 21: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

21

Example: Query Plan

SELECT RSTREAM r.time, h.rain, r.depth FROM River [NOW] r, Hilltop [AT NOW - 15 MINUTES] h, WHERE h.rain > 5 AND r.rain < h.rain AND h.rain >= (SELECT AVG(weather.rain) FROM Weather [RESCAN every day] WHERE weather.region = 'Peak District');

26 April 2010

• EXCHANGE

• JOINriver.rain<hilltop.rain

• ACQUIRE[time,rain]

rain > 5hilltop

• EVERY 15 min

• ACQUIRE[time,rain,

• depth] trueriver

• EVERY 15 min

• TIME_WINDOW[t-15, t-15, 15]

DELIVER

EXCHANGE

AVERAGE(rain)

SCAN[rain]

region = ‘Peak District’

weatherEVERY DAY

JOINh.rain >=

AVG(weather.rain)

Page 22: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

22

In-Network SNEE• Two-phase DQP

– Single-site• Push /

– Multi-site• Steiner tree routing

to reduce energy• Operator placement

– Location sensitive– Reduce

transmission• Data buffering• Time division agenda• nesC code

generated26 April 2010

routing

parsing/type checking

translation/rewriting

algorithm assignment

partitioning

where-scheduling

when-scheduling

code generation

<query, QoS-expectations>,<cost parameters, schemas, description(node,network)>

<N1, …, Nm> nesC code

abstract-syntactic tree

logical-algebraic form

physical-algebraic form

PAFrouting tree

RT fragmented-algebraic form

agenda

1

2

3

4

5

6

7

8

RT distributed-algebraic form

RT DAF

single-site phase

multi-site

phase

Page 23: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

27

SNEE now and futureSNEE Now• In-network SNEE

– Acquire-streams– Quality of service aware

• Expected lifetime• Total energy

consumption• Delivery time• Acquisition rate

– Runs on:• Simulators• TMoteSkys• TinyNodes

• Out-of-network SNEE– Receive-streams– Pull-based data sources

SNEE Future• SNEE-DQP (within

2010) – Combine in-network

and out-of-network versions

– Stored relations– Push-based data

sources• Beyond 2010

– Model building inside queries

– Greater resilience forin-network execution

26 April 2010

Page 24: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

28

SEMANTIC HETEROGENEITY

26 April 2010

Page 25: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

29

A Data Integration Approach• Heterogeneous

sources– Autonomous – Local schemas

• Homogeneous view– Mediated global

schema• Mapping

– Local-as-View– Global-as-View

26 April 2010

Global Schema

Query1 Queryn

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Mappings

Relies on agreement of a common global

schema

Page 26: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

30

P2P Integration Approach• Heterogeneous sources

– Autonomous – Local schemas

• Heterogeneous views– Multiple schemas

• Mappings– From sources to

common schema– Between pairs of

schema• Require common

integration data model

26 April 2010

Schema1

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Schemaj

Query1 Queryn

Mappings

Page 27: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

31

Semantic Integrator

SNEE-DQP

Query Translator

Data Translator

Query Resolver

Data Resolver

Q

[[Q]]

Q’

[[Q’]]

q

[[q]]

Tuples

Tuples

Semantic Integrator

26 April 2010

Streaming Source

Stored data

S2O Mappings

Tuples

Stored data

Streaming Source

Tuples

O2O Mappings

Page 28: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

32

P2P Integration Approach• Heterogeneous sources

– Autonomous – Local schemas

• Heterogeneous views– Multiple schemas

• Mappings– From sources to

common schema– Between pairs of

schema• Require common

integration data modelCan RDF do this?26 April 2010

Schema1

DB1

Wrapper1

DBk

Wrapperk

DBi

Wrapperi

Schemaj

Query1 Queryn

Mappings

Page 29: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

33

CAN RDB2RDF TOOLS FEASIBLY EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION?

A word of warning …

26 April 2010

Page 30: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

34

RDB2RDF: Two ApproachesExtract-Transform-Load• Data replicated as RDF

– Data can become stale• Native SPARQL query

support– Limited optimisation

mechanisms

Existing RDF stores• Jena• Seasame

Query-driven Conversion• Data stored as relations

• Native SQL query support

– Highly optimised access methods

• SPARQL queries must be translated

Existing translation systems• D2RQ• SquirrelRDF

26 April 2010

Page 31: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

35

Experiment• Time query evaluation

– Astronomy data set• ~500MB

– 5 real queries– No joins

• Systems compared:– Relational DB

• MySQL v5.1.25– RDB2RDF tools

• D2RQ v0.5.2• SquirrelRDF v0.1

– RDF Triple stores• Jena v2.5.6 (SDB)• Sesame v2.1.3 (Native)

26 April 2010

Relational DB

RDB2RDF

SPARQLquery

Triple store

SPARQLquery

Relational DB

SQLquery

Page 32: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

36

Performance Results

26 April 2010# Query 1 # Query 2 # Query 3 # Query 5 # Query 60

100

200

300

400

500

600

700

800

900

1000

MySQLD2RQSqRDFJenaSesame

3,4

50

5,3

39

21

,49

24

85

,93

2

2,7

33

7,2

29

4,0

90

1,3

07

17

,79

3

7,4

68

19

,98

43

72

,56

1

Page 33: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

37

The Show Stopper: Query Translation

• Each bound variable resulted in a self-join– RDBMS cannot optimize for this– RDBMS perform badly with self-joins

• Each row retrieved with a separate query– 1 query becomes n queries,

where n is cardinality of the relation• Predicate selection in RDB2RDF tool

– No optimization possible26 April 2010

Page 34: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

38

CAN RDB2RDF TOOLS FEASIBLE EXPOSE LARGE SCIENCE ARCHIVES FOR DATA INTEGRATION?

NOT CURRENTLY!

More work needed on query translation…(Gray et al, 2009)

26 April 2010

Page 35: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

39

Conclusions• Query-based access to distributed

data sources, both streaming and stored

• SNEEql and SNEE-DQP overcome data source heterogeneity

• SPARQLSTR and Semantic Integrator overcome semantic heterogeneity

26 April 2010

Page 36: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

40

Manchester TeamRAs• Ixent Galpin

• Alasdair J. G. Gray

PhDs• Christian Y. A.

Brenninkmeijer• Farhana Jabeen

Academics• Alvaro A. A.

Fernandes• Norman W. Paton

MSc Students• Jamil Naja• Varadarajan

Rajagopalan

26 April 2010

Page 37: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

41

Acknowledgements• This work was/is funded

– UK EPSRC through • DIAS-MC project• Explicator project (University of Glasgow)

– European Commission as part of the SemSorGrid4Env project

• SNEE is released under a permissive open source license; please visit:http://code.google.com/p/snee/

26 April 2010

Page 38: Distributed Query Processing over Streaming and Stored Heterogeneous Data Sources Alasdair J G Gray Information Management Group University of Manchester.

42

References1. K. Aberer, M. Hauswirth, and A. Salehi. Infrastructure for data processing in

large-scale interconnected sensor networks. In MDM 2007, pp198–205, 2007.2. A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani,

U. Srivastava, and J. Widom. Stream: The stanford data stream management system. In M. Garofalakis, J. Gehrke, and R. Rastogi, editors, Data Stream Management: Processing High-Speed Data Streams. Springer, (to appear).

3. A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: Semantic foundations and query execution. VLDB Journal 15(2):121–142, 2006.

4. C. Y. A. Brenninkmeijer, I. Galpin, A. A. A. Fernandes, and N. W. Paton. A semantics for a query language over sensors, streams and relations. In BNCOD 25, pp87–99, 2008.

5. I. Galpin, C. Y. A. Brenninkmeijer, F. Jabeen, A. A. A. Fernandes, and N. W. Paton. Comprehensive optimization of declarative sensor network queries. In SSDBM 2009, pp339–360, 2009.

6. A. J. G. Gray, N. Gray, and I. Ounis. Can RDB2RDF tools feasibily expose large science archives for data integration? In ESWC 2009, pp491–505, 2009.

26 April 2010


Recommended