Integrating Distributed Data Streams

transcript

Integrating DistributedData Streams

Alasdair J G GraySupervisors: M Howard Williams

Werner Nutt

7th June 2007

Overview

The problem Limits of current technology Proposed system

Architecture Query Answering Performance

Conclusions

Streams of Data Main sources: sensors Characteristics:

Unbounded Append only Frequency

Managed by: Sensor networks Network/Grid

monitoring Ubiquitous/Pervasive

computing environments

Reading

Streams are everywhere

InternetInternet

GridGrid

Job progressBookkeeping

Monitoring data

Grid Monitoring

Resources supplied by various institutions

Resources publish status information

Scheduler must allocate jobs to resources

Bookkeeping tracks resource usage

Users track job progress

Requirements

Ability to: Publish distributed streams of data Query multiple streams with no

knowledge of Existence of source streams Location of individual streams Access methods to individual streams

Scale to large numbers of users and sources

Data Integration System Several distributed

data sources Users send query to

Mediator Mediator

Translates user query into sub-queries

Combines results of sub-queries

Only for stored data sources

DB1DB2

Mediator

Stream Management System

Data streams into the server

Server applies long-standing queries to the streams

Answers streamed out

Users need to know which streams exist

Centralised server

Solution

Need a system that combines: Ability to access multiple sources

without specific source knowledge

Data integration Ability to process streams of data

Stream processing

A Stream Integration System!

Stream Integration System 1

Producer publishes streams of data

Consumer query for streams of data

Registry matches consumer requests with publications

Producer Producer Producer

Consumer Consumer

Registry

Publishing Monitoring Data Stream data can be represented in terms

of relations with Keys: “what” and “where” Measurements: the “value” Timestamps: “when”For example, Network ThroughPut

One reading is a tuple in the relationNTP (from, to, tool, psize, latency, timestamp)

('hw', 'ral', 'ping', 32, 11.1, 2005-06-24-15:05:34)

NTP (from, to, tool, psize, latency, timestamp)

Consuming Monitoring Data

Users are interested in how the grid changes over time. For example,

1. Latency for large packets sent from hw2. Links with a low latency as recorded by the

PingER tool

These can be expressed as SQL selection queries

)(: 1024''1 NTPq psizehwfrom

)(: 0.10''2 NTPq latencypingtool

What is an Answer to a Query?

Global relations contain no tuples (virtual

relation) Need to translate into query over sources An answer stream should be

Sound Complete Duplicate free Weakly ordered: all tuples that share the same

key value will be in timestamp order Order in general is difficult in a distributed

setting Weak order sufficient for more complex

queries such as aggregates

Λ from='hw' Λ tool='udp'Λ from='ral' Λ tool='ping'from='hw' Λ psize≥1024

Query Planning: Consumer Query

Satisfiability used to find relevant producers

S1: from='hw' Λ tool='udp'

S2: from='hw' Λ tool='ping'

S3: from='ral' Λ tool='ping'

q1: from='hw' Λ psize≥1024

S4: from='ral' Λ tool='udp'

S5: from=‘an' Λ tool='ping'

q2: tool=‘ping' Λ latency≤10.0

Scalability is an Issue

Problem: Every consumer contacting every producer of interest does not scale

Even a small Grid of less than a dozen sites has problems

Grids may contain thousands of resourcesFor example,

Large Hadron Collider Computing Grid (LCG)

Republishers Allow the System to Scale

A republisher Consumes answers to a

selection query Merges "trickles" into

streams Publishes

Answer stream Latest-state answer History

Problem: Choice in where to obtain information

Producer S1 Producer S2

Republisher

Meta query plan contains choice

Query plan uses one of R1 or R3

Query Planning in the Presence of Republishers

Find all relevant publishers

Rank according to data provided

R1: from='hw' R2: from='ral'

R3:from='hw' Λ tool='ping'

q2: tool='ping' Λ latency≤10.0

Weak Order is not Guaranteed

Tuples for same channel

(3) published before (8)

Arrive at consumer in wrong order

latency≤5.0 latency>5.0

q2: tool=‘ping' Λ latency≤10.0

slowlink

(3) (8)

(8) (3)

Generating Well Formed Query Plans

A publisher is relevant for a global query if

1. Conditions are satisfiable, and2. All measurements that agree on their key

values come from the same publisher

The measurement condition can be checked using entailment.

Previous example was well formed.

Query Re-Planning

Queries are long-lived Set of publishers can change Query plans should reflect changes

How does a new Republisher affect our Consumers?

Find consumers for which R4 is relevant

Compare R4 to publishers in Meta Query Plan

R4: TRUE

q2: tool= 'ping' Λ latency≤10.0

Planning a Republisher Query

Applying Consumer planning techniques results in a problem

R4: TRUE

Problem: Hierarchy contains cycles Republishers disconnected

from Producers

Correctness: streams answer queries Cycle freeness: loops can lead to

duplicates Uniqueness: hierarchy defined for a

set of publishers Local planning: Publishers and

Consumers only need to communicate with the Registry

Desirable Properties for a Hierarchy

Generating Well Formed Hierarchies

Need a stricter relevance criterion R1 can consume from R2 iff

1. Everything R2 offers is relevant to R1, and

2. R1 offers something R2 does not.

Can be checked by entailment Ensures

No loops in the hierarchy Republishers connected to the Producers

Re-Planning, Re-Visited!

Stricter relevance criterion

Republishers only consume from publishers below them

R4: TRUE

R4 is not relevant for R1

Republishers Effect on Latency

Tuple published by producer Tuple passes through some number

of republishers Tuple arrives at consumer

Republishers add to the time taken!

Performance Measure

Number of Republishers

Average delivery time

Conclusions

Distributed streams of data are increasing be made available

Distributed users interested in multiple streams

Developed a system for Publishing distributed data streams Querying multiple stream sources

without source knowledge Republishers required to allow

system to scale

Future Work

Increase complexity of query language

Integrate stored and stream sources

Integrating Distributed Data Streams

Documents