Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | edgar-carpenter |
View: | 214 times |
Download: | 2 times |
Mirek RiedewaldDepartment of Computer Science
Cornell University
Efficient Processing of Massive Data Streams for
Mining and Monitoring
Acknowledgements
Al Demers Abhinandan Das Alin Dobra Sasha Evfimievski Johannes Gehrke KD-D initiative (Art Becker et al.)
Introduction
Data streams versus databases Infinite stream, continuous queries Limited resources
Network monitoring High arrival rates, approximation [CGJSS02]
Stock trading Complex computation [ZS02]
Retail, E-business, Intelligence, Medical Surveillance Identify relevant information on-the-fly, archive
for data mining Exact results, error guarantees
Information Spheres
Local Information Sphere Within each organization Continuous processing of distributed
data streams Online evaluation of thousands of
triggers Storage/archival of important data
Global Information Sphere Between organizations Share data in privacy preserving way
Local Information Sphere
Distributed data stream event processing and online data mining
Technical challenges Blocking operators, unbounded state Graceful degradation under increasing load Integration with archive Processing of physically distributed streams
Event Matching, Correlation
Join of data streams
Brand Mpix Price
Canon
3.0 200
Mpix Price
>2.0 <250
Event Matching, Correlation
Join of data streams
Brand Mpix Price
Canon
3.0 200
Fuji 3.0 100
Mpix Price
>2.0 <250
>4.0 <400
Event Matching, Correlation
Join of data streams
Equi-join, text similarity, geographical proximity,…
Problem: unbounded state, computation
Brand Mpix Price
Canon
3.0 180
Fuji 3.0 220
Kodak
4.0 340
Mpix Price
> 2.0 < 250
> 4.0 < 400
= 3.0 < 200
Window Joins
Restrict join to window of most recent records (tuples) Landmark window Sliding window based on time or
number of records Problem definition
Window based on time: size w Synchronous record arrival Equi-join
Abstract Model
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that r.A=s.A
Sliding window of size w
1 1 1
2 3 1
R
S
(r0,s2), (r1,s2), (r2,s2)
Abstract Model (cont.)
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that r.A=s.A
Sliding window of size w
1 1 1 3
2 3 1 1
R
S
(r0,s2), (r1,s2), (r2,s2)(r3,s1), (r1,s3), (r2,s3)
Abstract Model (cont.)
Data streams R(A,…), S(A,…) Compute equi-join on A
Match all r and s of streams R, S such that r.A=s.A
Sliding window of size w
1 1 1 3 2
2 3 1 1 4
R
S
(r0,s2), (r1,s2), (r2,s2)(r3,s1), (r1,s3), (r2,s3)No new output
Limited Resources
Focus on limited memory M<2w State of the art: random load
shedding [KNV03] Random sample of streams Desired approach: semantic load
shedding Goal: graceful degradation
Approximation Set-valued result: Error measure?
Set-Approximation Error
What is a good error measure? Information Retrieval, Statistics, Data Mining
Matching coefficient Dice coefficient Jaccard coefficient Cosine coefficient Overlap coefficient
Earth Mover’s Distance (EMD) [RTG98] Match And Compare (MAC) [IP99]
Join: subset of output result EMD, Overlap coefficient trivially 0 or 1 Others (except MAC) reduce to MAX-subset error
measure
|| BA|)||/(|||2 BABA
||/|| BABA
||||/|| BABA |}||,min{|/|| BABA
Optimization Problem
Select records to be kept in memory such that the result size is maximized subject to memory constraints
Lightweight online technique Adaptivity in presence of memory
fluctuations
Optimal Offline Algorithm
What is the best possible that can be achieved? Optimal sampling strategy for MAX-
subset Bottom-line for evaluation of any online
algorithm Same optimization problem, but knows
future Finite subsets of input streams
Formulate as linear flow problem
Generation of Flow Model
R=1,1,1,3
S=2,3,1,1
M=2, w=3
Fixed memory allocation
3 -3
cost
Capacity: 0..1, linear cost
-1
-1 -1-1
-1
-1
Keep in memory
Replace
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
-1
-1-1
Correspondence to Windows
R=1,1,1,3
S=2,3,1,1
-1
-1-1
-1
-1-1
Complexity
Integer solution exists Optimal solution found in O(n2 m log
n) N input size of single stream #nodes: n < 2wN + N + 2 #arcs: m < 2n + M + 1
Reasonable costs for benchmarking Approx. 1GB memory (w=800, M=800) Approx. 1h computation time
Optimal Flow
R=1,1,1,3
S=2,3,1,1
M=2, w=3
Fixed memory allocation
3 -3
cost
Capacity: 0..1, linear cost
-1
-1 -1-1
-1
-1
Keep in memory
Replace
Easy to Extend
R=1,1,1,3
S=2,3,1,1
M=2, w=3
Variable memory allocation
3 -3
cost
Capacity: 0..1, linear cost
-1
-1 -1-1
-1
-1
Keep in memory
Replace
Online Heuristics
Maximize expected output PROB: sort tuples by join partner arrival
probability LIFE: sort tuples by product of partner
arrival probability and remaining lifetime
Maintain stream statistics Histograms (DGIM02, TGIK02), wavelets
(GKMS01), quantiles (GKMS02, GK01)
Approximation Quality
Effect of Skew
Summary
Information sphere architecture Optimal algorithm and fast efficient
heuristic for sliding window joins Open problems
Other set error measures, resource models Other joins: compress records Complex queries Distributed processing Integration with other techniques into local
information sphere
Related Work
Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI)
Memory requirements [ABBMW02,TM02]
Aggregation Alon, Bar-Yossef, Datar, Dobra,
Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy
Other Results
[DGR03] Integration with archive
Load smoothing, not shedding Novel “error” measure: archive access cost
Static join for sensor networks Maximize result size subject to constraints on
energy consumption Polynomial dynamic programming solution Fast 2-approximation algorithms NP-hardness proof for join of 3 or more streams
Other Results (cont.)
[DGGR02] Computation of aggregates over
streams for multiple joins Small pseudo-random sketch synopses
(randomized linear projections) Explicit, tunable error guarantees Sketch partitioning to boost accuracy
(intelligently partition join attribute space)
Thanks!
Questions?
?
?
?
?
?
?
?