Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, akarim @lums.edu.pk) CS-509-Masters of Science (CS)...

Zahid Irfan & Dr. Asim Karim (Advisor)

(zahidi, akarim @lums.edu.pk)CS-509-Masters of Science (CS) Project

Lahore University of Management Sciences,Lahore, Pakistan

8 May 2004

Approximate Query Processing (AQP) in Data Streams

Acknowledgement

This work is primarily based on the research paper “One-pass wavelets decompositions of data streams” by Gilbert, Muthukrishnan, Strauss and Kotidis, IEEE Trans. Knowledge and Data Engineering May/June, 2003.

Work by Muthukrishnan, Piotr Indyk and of course Johnson-Lindenstrauss.

Introduction Streams and Streaming Models Wavelet Transform & Embedded

Vectors Pseudo-Random Number Generator Implementation Details Test Results Conclusions and Future Work

AQP in Data Streams

Lets solve a puzzle. Guess the missing number in a random sequence of numbers [1…N] without repetition.

Introduction

Space Requirements O (1). Time Complexity O (n).

What about two numbers, three numbers …. and so on…

Data Stream “A sequence of digitally encoded

signals used to represent information in transmission”.

Input stream is the sequence a [i], arrives sequentially item by item.

Data Streams

Applications Networks Data Monitoring.

Applied to Traffic Flow Analysis World Wide Web.

Website hits, statistics etc. Online Transactions Processing

System Large Databases Query Processing

Data Streams Applications

Time Series Comprises value of the same quantity

over different time intervals. Typical examples

Daily closing values of Stock Exchange Traffic at an IP-Link at time intervals.

Stream Models

Cash Register Model Positive updates arrive over period of

time. Typical examples

well … Cash Register Cricket Scores Internet web-site hits or other statistics.

Stream Models

Turnstile Model Fully dynamic model Updates are both negative & positive

e.g. Passengers in an airport

Relative Hardness Turnstile > Cash Register > Time Series “Depends and varies from application to

application”.

Stream Models

Wavelets A mathematical hierarchical tool for

decomposition of signals/ functions. Types of Wavelets

Haar Wavelets Daubechies Wavelets Many more…

Wavelet Transform

Haar Wavelet Example

Resolution Averages Detail CoefficientsD = [2, 2, 0, 2, 3, 5, 4, 4]

[2, 1, 4, 4] [0, -1, -1, 0]

[1.5, 4] [0.5, 0]

[2.75] [-1.25]

----3

2

1

0

Haar Wavelet Decomposition [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]

Wavelet in <,> Space Haar Wavelets can be represented as

the following. Example vector A of N=4, 4 coefficients.

W1= 1/N*[1 1 1 1], W2 = 1/N*[1 1 -1 -1], W3=1/N*[1 -1 1 -1], W4=1/N*[1 1 1 -1]

1st Coefficient = <A,W1>. Average Coefficient 2nd Coefficient = <A,W2>. Detail Coefficient 3rd Coefficient = <A,W3>. Detail Coefficient 4th Coefficient = <A,W4>. Detail Coefficient

Embedding Vectors

Embedding Vectors Any n-point metric space can be

embedded into an O(log2 n) dimensional Euclidean space and L1 metric with 1+є distortion

f(v) = embedding for vector v = < <v, r1>, <v, r1>, … <v, rk>

>

Johnson-Lindenstrauss (JL) Lemma Simply stated <a,b>~<a,rj>*<b,rj>

Where j=1…k, k<<N rj is random vector= {1, -1 with equal

probability} Implications

Represent a vector in RN space in k-dimensional space.

Benefits : Approximate Queries… ??

Johnson-Lindenstruass Lemma

<a,b>~<a,rj>*<b,rj> Approximate queries can be used by

choosing special b. Query ith value choose b=[ 0..010…0] Range Query (i,j) value choose

b=[ 0..01..10…0], where b[x]=1 for i<=x<=j.

What's the catch?? … rj is also size of N. So where to store the random vectors??

AQP & JL-Lemma

Solution to large space over head is generate the random vectors on the fly!!

Such as : for (i=0;i<k;i++) { srand (i);

for (j=0;j<N;j++) {rand (); }

} This solution works but there is a more

elegant solution to this problem. Reed-Muller Codes Extractor.

Pseudo-Random Generator

Reed-Muller Generator The Matrix

values represent RM codes.

RM (x,y)= Replace

01 & 1 -1 we get wavelet basis vectors.

2 mod 2 1X

y

Benefits of Reed-Muller Pseudo Random generator Generated on the fly. Every value is independently

computed without anything to do with the previous values.

Most nearly imitates Wavelet basis vectors.

Hence the sketch contains most of the energy of the signal.

Reed-Muller PR Generator

Things learnt so far There is a way to embed the N data

into k<<N vectors JL-Lemma : <a,b>~<a,r><b,r> Reed-Muller Codes excellent imitators

of both wavelet basis vectors as well as random vectors.

Query Processing is possible thanks to JL- Lemma.

Lessons so far !!

Implementation Details

Implementation Trivia Implemented in Visual C++ 6.0 Design follows Classes and Objects

paradigm Test Results and graphs from MS

Excel

Data Flow Diagram

Dataset

DatasetGenerator

Data StreamGenerator

Wavelets-basedDecomposition

Reed-MullerGenerator

SketchQueryProcessing

Engine

Dataset Generator Synthetic Data Set was generated

using Random Distributions. Normal Distribution

Calling Telephone Number 9497000~9497999 (1000 lines)

Receiving Telephone Number Exponential Distribution

Call Time 0~512 minutes

Data Streamer

The data streaming class offers methods, which help in useful imitation of a real-time data stream by continuously presenting the program with data. Type DataStreamer::getData();

Pseudo Random Generator

2 mod 2

),(1X

yyxG

This class calculates the Reed-Muller based Pseudo-random Numbers.

type PseudoRandomGenerator::getRandom (int X,int Y);

Uses the formula

Data Decomposition

The data is decomposed into a sketch by calculating the dot product of data stream with O (log N) random vectors.

The sketch is stored into Main Memory to be utilized by the query processing engine. Sketch [j]+=Data [i]*Random (i, j);

Here i=(1,N) and j=(1,k);

Query Processing Engine

The Query Processing Engine uses the sketch and a new vector b. Uses the same old JL-Lemma

<a,b>~<a,rj>*<b,rj> Setting various values of b result in

theoretically any sort of query.

Point Query Processing

Point Query Point Query can be processed by

asking for any single value in the whole data stream.

Point Query Algorithm Prepare b[i]={0 for i !=j , 1 for i=j} and

generate <b,r> QuerySketch[j] +=B[i] * Random (i,j); Result = (DataSketch * Query Sketch)/ N

Range Query Processing Range Query

Range Queries specify the low and high between which the query is to be processed.

Even multiple ranges can be specified Query Algorithm

Prepare b[i]={0 for i !=j , 1 for i=j} and generate <b,r>

QuerySketch[j] +=B[i] * Random (i,j); Result = (DataSketch * Query Sketch)/ N

AQP Test

Time Complexity Analysis Query Processing Accuracy with

Data Size Query Processing Accuracy with

Sketch Size

Time Complexity

Time Complexity The following Time complexities were

found to be linear in size of data. Sketching Time Query Processing Time

Time Complexity (Sketching)

Sketching Time versus Data Size (Sketch Size assumed to be log N)

0

20

40

60

80

100

120

10,000.00 100,000.00 1,000,000.00 10,000,000.00

Ske

tch

ing

Tim

e (s

eco

nd

s)

`

Time Complexity (Query)

Querying Time versus Data Size (Sketch Size assumed to be log N)

0

20

40

60

80

100

120

10,000.00 100,000.00 1,000,000.00 10,000,000.00

Qu

eryi

ng

Tim

e (s

eco

nd

s)

`

Accuracy versus Data Size

Data Size versus Accuracy of Query PSNR (dB) versus Data Size

Data Size is increased by Power of 2 Sketch size assumed to be log N

PSNR (dB) versus Data Size

PSNR(dB) versus Data Size (Sketch Size assumed to be log N)

100

105

110

115

120

10.00 1,000,010.00

2,000,010.00

3,000,010.00

4,000,010.00

5,000,010.00

`

Accuracy versus Sketch Size

Accuracy of Query against the Sketch Size. PSNR (dB) versus Sketch Size

Data Size is assumed to be constant = 32768

Sketch Size is varied

PSNR (dB) versus Sketch Size

PSNR(dB) versus Sketch Size(DataSize N=32768)

100110120

130140150

0 20 40 60 80 100 120

Sketch Size

PS

NR

Conclusions

Space Complexity Reduction Prohibitively large data stream in sub-

linear space. Time Complexity Reduction

one-pass data stream algorithm. Scalability to multi-dimensions

Applications and Future Work Data Mining Streams Multimedia & Databases

Trying it with Video coding might be fun or disaster

Graph Theory Problems MST, Matching etc. need to be solved in the

streaming model. Computational Geometry

Earth observation data streams or weather data streams

Solve any problem that can be modeled as a data stream

References S. Acharaya, P.B. Gibbons, V. Poosala and S. Ramaswamy, “Join

Synopsis for Approximate Query Answering”, ACM In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999.

J. M. Hellerstein, P. J. Haas and H. J. Wang, “Online Aggregation”, In the Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, 1997.

Y. E. Iaonnidis and V. Poosala, “Histograms-Based Approximation to Set-Valued Query Answers”, In the proceedings of 25th International Conference on Very Large Databases, 1999.

K. Chakrabarti, M. Garofalakis, R. Rastogi and K. Shim, “Approximate Query Processing Using Wavelets”, The Proceedings of the 26th Conference on Very Large Databases, Eygpt, 2000.

F. Olken, “Random Sampling in Databases”, PhD Thesis, University of California at Berkeley, 1993.

A.C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strass, “One-pass wavelet Decomposition of Data Streams”, IEEE Transactions of Knowledge and Data Engineering, Vol. 15, No.3, May/June 2003.

A. Ta-Shma, D. Zuckerman, and S. Safra, “Extractors from Reed-Muller Codes” In Proceedings of 42nd Annual IEEE Symposium on Foundations of Computer Science, 2001.

Questions & Answers

Thanks to the following for their sincere help in this projectDr. Asim Karim, Dr. Sarmad Abbasi, Dr. Asim Loan, Dr. Sohaib A.

Khan and all my friends speciallyLaeeq Aslam and Aimal Tariq Rextin.

Date post:	01-Jan-2016
Category:	Documents
Upload:	norma-douglas
View:	220 times
Download:	0 times

Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, akarim @lums.edu.pk) CS-509-Masters of Science (CS)...

Documents