Post on 31-Dec-2015
description
transcript
Integrating DistributedData Streams
Alasdair J G GraySupervisors: M Howard Williams
Werner Nutt
7th June 2007
Overview
The problem Limits of current technology Proposed system
Architecture Query Answering Performance
Conclusions
Streams of Data Main sources: sensors Characteristics:
Unbounded Append only Frequency
Managed by: Sensor networks Network/Grid
monitoring Ubiquitous/Pervasive
computing environments
Reading
Streams are everywhere
InternetInternet
GridGrid
Job progressBookkeeping
Monitoring data
Grid Monitoring
Resources supplied by various institutions
Resources publish status information
Scheduler must allocate jobs to resources
Bookkeeping tracks resource usage
Users track job progress
Requirements
Ability to: Publish distributed streams of data Query multiple streams with no
knowledge of Existence of source streams Location of individual streams Access methods to individual streams
Scale to large numbers of users and sources
Data Integration System Several distributed
data sources Users send query to
Mediator Mediator
Translates user query into sub-queries
Combines results of sub-queries
Only for stored data sources
DB1DB2
DB3
Mediator
Stream Management System
Data streams into the server
Server applies long-standing queries to the streams
Answers streamed out
Users need to know which streams exist
Centralised server
Solution
Need a system that combines: Ability to access multiple sources
without specific source knowledge
Data integration Ability to process streams of data
Stream processing
A Stream Integration System!
Stream Integration System 1
Producer publishes streams of data
Consumer query for streams of data
Registry matches consumer requests with publications
Producer Producer Producer
Consumer Consumer
Registry
Publishing Monitoring Data Stream data can be represented in terms
of relations with Keys: “what” and “where” Measurements: the “value” Timestamps: “when”For example, Network ThroughPut
One reading is a tuple in the relationNTP (from, to, tool, psize, latency, timestamp)
('hw', 'ral', 'ping', 32, 11.1, 2005-06-24-15:05:34)
NTP (from, to, tool, psize, latency, timestamp)
Consuming Monitoring Data
Users are interested in how the grid changes over time. For example,
1. Latency for large packets sent from hw2. Links with a low latency as recorded by the
PingER tool
These can be expressed as SQL selection queries
)(: 1024''1 NTPq psizehwfrom
)(: 0.10''2 NTPq latencypingtool
What is an Answer to a Query?
Global relations contain no tuples (virtual
relation) Need to translate into query over sources An answer stream should be
Sound Complete Duplicate free Weakly ordered: all tuples that share the same
key value will be in timestamp order Order in general is difficult in a distributed
setting Weak order sufficient for more complex
queries such as aggregates
Λ from='hw' Λ tool='udp'Λ from='ral' Λ tool='ping'from='hw' Λ psize≥1024
Query Planning: Consumer Query
Satisfiability used to find relevant producers
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool=‘ping' Λ latency≤10.0
Scalability is an Issue
Problem: Every consumer contacting every producer of interest does not scale
Even a small Grid of less than a dozen sites has problems
Grids may contain thousands of resourcesFor example,
Large Hadron Collider Computing Grid (LCG)
Republishers Allow the System to Scale
A republisher Consumes answers to a
selection query Merges "trickles" into
streams Publishes
Answer stream Latest-state answer History
Problem: Choice in where to obtain information
Producer S1 Producer S2
Republisher
Meta query plan contains choice
Query plan uses one of R1 or R3
Query Planning in the Presence of Republishers
Find all relevant publishers
Rank according to data provided
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool='ping' Λ latency≤10.0
Weak Order is not Guaranteed
Tuples for same channel
(3) published before (8)
Arrive at consumer in wrong order
S2: from='hw' Λ tool='ping'
latency≤5.0 latency>5.0
q2: tool=‘ping' Λ latency≤10.0
slowlink
(3) (8)
(3) (8)
(8) (3)
Generating Well Formed Query Plans
A publisher is relevant for a global query if
1. Conditions are satisfiable, and2. All measurements that agree on their key
values come from the same publisher
The measurement condition can be checked using entailment.
Previous example was well formed.
Query Re-Planning
Queries are long-lived Set of publishers can change Query plans should reflect changes
How does a new Republisher affect our Consumers?
Find consumers for which R4 is relevant
Compare R4 to publishers in Meta Query Plan
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
q1: from='hw' Λ psize≥1024
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
q2: tool= 'ping' Λ latency≤10.0
Planning a Republisher Query
Applying Consumer planning techniques results in a problem
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
Problem: Hierarchy contains cycles Republishers disconnected
from Producers
Correctness: streams answer queries Cycle freeness: loops can lead to
duplicates Uniqueness: hierarchy defined for a
set of publishers Local planning: Publishers and
Consumers only need to communicate with the Registry
Desirable Properties for a Hierarchy
Generating Well Formed Hierarchies
Need a stricter relevance criterion R1 can consume from R2 iff
1. Everything R2 offers is relevant to R1, and
2. R1 offers something R2 does not.
Can be checked by entailment Ensures
No loops in the hierarchy Republishers connected to the Producers
Re-Planning, Re-Visited!
Stricter relevance criterion
Republishers only consume from publishers below them
S1: from='hw' Λ tool='udp'
S2: from='hw' Λ tool='ping'
S3: from='ral' Λ tool='ping'
R4: TRUE
R1: from='hw' R2: from='ral'
R3:from='hw' Λ tool='ping'
S4: from='ral' Λ tool='udp'
S5: from=‘an' Λ tool='ping'
R4 is not relevant for R1
Republishers Effect on Latency
Tuple published by producer Tuple passes through some number
of republishers Tuple arrives at consumer
Republishers add to the time taken!
Performance Measure
Number of Republishers
Average delivery time
(ms)
0 26
1 49
2 67
3 82
Conclusions
Distributed streams of data are increasing be made available
Distributed users interested in multiple streams
Developed a system for Publishing distributed data streams Querying multiple stream sources
without source knowledge Republishers required to allow
system to scale
Future Work
Increase complexity of query language
Integrate stored and stream sources