+ All Categories
Home > Documents > Online Computation and Continuous Maintaining of Quantile Summaries

Online Computation and Continuous Maintaining of Quantile Summaries

Date post: 08-Jan-2016
Category:
Upload: mahdis
View: 25 times
Download: 0 times
Share this document with a friend
Description:
Online Computation and Continuous Maintaining of Quantile Summaries. Tian Xia Database Lab @ CCIS Northeastern University April 16, 2004. References. M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD , pages 58-66, 2001. - PowerPoint PPT Presentation
49
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database Lab @ CCIS Northeastern University April 16, 2004
Transcript
Page 1: Online Computation and Continuous Maintaining of Quantile Summaries

1

Online Computation and Continuous Maintaining of Quantile Summaries

Tian XiaDatabase Lab @ CCISNortheastern University

April 16, 2004

Page 2: Online Computation and Continuous Maintaining of Quantile Summaries

2

References

M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, pages 58-66, 2001.

X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. In ICDE, pages 362-373, 2004

Page 3: Online Computation and Continuous Maintaining of Quantile Summaries

3

Outline of this talk

Quantile Estimation Overview GK-quantile Summary Algorithm

Data Structure Operations Space Complexity Analysis

Sliding Window Model

Page 4: Online Computation and Continuous Maintaining of Quantile Summaries

4

Problem Definitions

-Quantile: A -quantile ((0,1]) of an ordered sequence of N data elements is the element with rank N .

Quantile Query: Given , find the data element with rank N among all elements in the stream. Variation: N recent elements (sliding window model).

(-approximate): Find the element with rank r within the interval [r-N, r+N].

Page 5: Online Computation and Continuous Maintaining of Quantile Summaries

5

Example of A Quantile Query

The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.

0.5-quantile returns the element ranked 8, which is 8.

0.25-approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}.

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

t9

7

t10

8

t11

11

t12

4

t13

5

t14

2

t15

3

Page 6: Online Computation and Continuous Maintaining of Quantile Summaries

6

Why Approximation?

Munro and Paterson (Theoretical Computer Science,

1980) showed that any algorithm which exactly computes -quantile of N data elements in p passes, requires a space of .

Approximate quantile techniques are necessary to achieve sub-linear space efficiency.

Page 7: Online Computation and Continuous Maintaining of Quantile Summaries

7

Quantile Summary

Quantile Summary: A small number of objects from the input data sequence, which could be used (by quantile estimator) to answer quantile queries.

Other summary methods of large data sets include average, standard deviation, histogram, counting sketch (FM-sketch), etc.

Page 8: Online Computation and Continuous Maintaining of Quantile Summaries

8

Properties of A Good Quantile Estimator Provide tunable and explicit a priori guarantees

on the precision of the approximation, e.g. it is -approximate.

Data independent. Use as small a memory footprint as possible,

which includes temporary storage.

Page 9: Online Computation and Continuous Maintaining of Quantile Summaries

9

Previous Work

Manku, Rajagopalan, and Lindsay (SIGMOD, 19

98) proposed a single-pass algorithm that constructs an -approximate quantile summary. Space complexity: log2N. It requires an advance knowledge of N, the size of

data set. Won’t work in data stream environment.

Page 10: Online Computation and Continuous Maintaining of Quantile Summaries

10

Outline of this talk

Quantile Estimation Overview GK-quantile Summary Algorithm

Data Structure Operations Space Complexity Analysis

Sliding Window Model

Page 11: Online Computation and Continuous Maintaining of Quantile Summaries

11

Contributions of GK-algorithm Dynamically adjust quantile summary with the

growth of N, the total number of data elements in the data stream.

Space complexity is reduced to logN.

Page 12: Online Computation and Continuous Maintaining of Quantile Summaries

12

Assumptions

A new data element arrives after each unit of time. n denotes both the number of elements of the data s

equence, as well as the current time. A data element is represented by its value v. rmin(v) and rmax(v) denote respectively the lower and u

pper bounds on the actual rank r of v among the elements seen so far.

Page 13: Online Computation and Continuous Maintaining of Quantile Summaries

13

The Summary Data Structure

GK-algorithm maintains a summary data structure S=S(n) at any point in time n.

S(n) consists of an ordered (non-decreasing) sequence of tuples which corresponds to a subset of the elements seen thus far.

Page 14: Online Computation and Continuous Maintaining of Quantile Summaries

14

The Summary Data Structure

S = {t0, t1, …, ts-1}, where ti = (vi, gi, Δi). vi is the value of one of the elements seen so far.

gi = rmin(vi) - rmin(vi-1)

Δi = rmax(vi) - rmin(vi)

v0 and vs-1 always correspond to the minimum and the maximum elements seen so far.

Page 15: Online Computation and Continuous Maintaining of Quantile Summaries

15

The Summary Data Structure

Given gi = rmin(vi) - rmin(vi-1) and Δi = rmax(vi) - rmi

n(vi), rmin(vi) = ji gj

rmax(vi) = ji gj +Δi

gi +Δi -1 is upper bound on the total number of elements that may have fallen between vi-1and vi.

rmin(vs-1) = i gj = n.

Page 16: Online Computation and Continuous Maintaining of Quantile Summaries

16

Example of A Quantile Summary

{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples.

For clarity, re-write the tuples of the above summary in the form ti = (vi, rmin(vi), rmax(vi)) as follows: {(1,1,1), (2,2,9), (3,3,10), (4,4,10), (10,10,10), (12,16,16)}.

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

t9

7

t10

8

t11

11

t12

4

t13

5

t14

2

t15

3

Page 17: Online Computation and Continuous Maintaining of Quantile Summaries

17

Error Rate?

PROPOSITION 1: Given a quantile summary S, a -quantile can always be identified to within an error of maxi(gi+Δi)/2.

COROLLARY 1: If at any time n, the summary S(n) satisfies the property that maxigi+i 2n, than we can answer any -quantile query to within an n precision.

Page 18: Online Computation and Continuous Maintaining of Quantile Summaries

18

QUANTILE ()

QUANTILE(): To compute an -approximate -quantile from the summary S(n) after n data elements, compute the rank r=n. Find i such that both r rmin(vi) n and rmax(vi) r n, return vi. i.e. r n rmin(vi) rmax(vi) r n

Page 19: Online Computation and Continuous Maintaining of Quantile Summaries

19

Example of A Quantile Summary

{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is 0.25-approximate with respect to the data stream.

An 0.25-approximate 0.5-quantile returns the element (4,1,6) or (10,6,0).

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

t9

7

t10

8

t11

11

t12

4

t13

5

t14

2

t15

3

Page 20: Online Computation and Continuous Maintaining of Quantile Summaries

20

Outline of this talk

Quantile Estimation Overview GK-quantile Summary Algorithm

Data Structure Operations Space Complexity Analysis

Sliding Window Model

Page 21: Online Computation and Continuous Maintaining of Quantile Summaries

21

How does their algorithm work? Insert a tuple in the summary corresponding to a

new incoming element. Periodically sweep over the summary to “merge”

some of the tuples into their neighbors. It ensures the space requirement.

At all times maxi (gi +Δi) 2n.

What to merge & How to merge?

Page 22: Online Computation and Continuous Maintaining of Quantile Summaries

22

INSERT (v)

INSERT(v): Find the smallest i, such that vi-1 vvi,

and insert the tuple (v, 1, 2n ), between ti-1 and ti. Increment s. As a special case, if v is the new minimum or the maximum element seen, then insert (v, 1, 0).

Page 23: Online Computation and Continuous Maintaining of Quantile Summaries

23

Example of INSERT

S={(12, 1, 0)}, n=1 S={(6, 1, 0), (12, 1, 0)}, n=2 S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3 S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4

t0

12

t3

10

t4

1

t8

6

25.0

Page 24: Online Computation and Continuous Maintaining of Quantile Summaries

24

Merge

Space will increase with insertions. Intuitively, two tuples (vi, gi,Δi) and (vj, gj,Δj) c

an be merged into a new tuple (vk, gk,Δk), as l

ong as gk +Δk 2n.

An individual tuple is full if gk +Δk 2n. Capacity and Band are introduced.

Page 25: Online Computation and Continuous Maintaining of Quantile Summaries

25

Capacity and Band

The capacity of a tuple is the maximum numer of elements that can be counted by gi before the tuple bec

ome full. (gi 2n i). The merge phase will free up space by merging tuples with

small capacities into tuples with similar or larger capacities. Bands: Roughly speaking, divide the Δs into bands t

hat lie between elements of (0, ½2n, ¾2n, …, 2i-1 2i 2n, …, 2n-1, 2n).

The larger the capacity (with smallerΔ), the larger the band.

Page 26: Online Computation and Continuous Maintaining of Quantile Summaries

26

Example of A Quantile Summary

{(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples.

(2,1,7) and (3,1,7) are in the lowest band. (1,1,0), (10,6,0) and (12,6,0) are in the highest bands.

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

t9

7

t10

8

t11

11

t12

4

t13

5

t14

2

t15

3

Page 27: Online Computation and Continuous Maintaining of Quantile Summaries

27

Band

Strictly, Given from 1 to log2n, p=2n, band is the set of allΔsuch that p2 (p mod 2)Δ p2-1 (p mod 2-1). If twoΔs are ever in the same band, they never ap

pear in different bands as n increase. In band0,Δ= 2n .

A tree structure is imposed to facilitate merges between bands.

Page 28: Online Computation and Continuous Maintaining of Quantile Summaries

28

Tree Representation

Given a summary S = {t0, t1, …, ts-1}, the tree T associated with S contains a node Vi for each ti and a special root node R.

The parent of a node Vi is the node Vj such that j is the least index greater than i with band(ti) > band(tj). Otherwise R is the parent.

Page 29: Online Computation and Continuous Maintaining of Quantile Summaries

29

Tree Representation

PROPOSITION 3: The children of any node in T are always arranged in non-increasing order of band in S.

PROPOSITION 4: For any node V, the set of all its descendants arranged in T forms a contiguous segment in S.

(1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0)

R

Page 30: Online Computation and Continuous Maintaining of Quantile Summaries

30

Merge Actually

GK-algorithm will merge together a node and all its descendants into either its parent node or into its right sibling.

The tuple that results after the merge must not be full, i.e. gi +i 2n.

The operation is called COMPRESS().

Page 31: Online Computation and Continuous Maintaining of Quantile Summaries

31

COMPRESS ( )

The operation COMPRESS tries to merge together a node and all its descendants into either parent node or into its right sibling.

COMPRESS()

for i from s-2 to 0 do

if ((BAND(i, 2n) BAND(i+1, 2n)) && g*gi+1i+1 2n)) then

DELETE all descendants of ti and the tuple ti itself;

end if

end for

end COMPRESS

g* denotes the sum of g-values of the tuple ti and all its descendants in T.

Page 32: Online Computation and Continuous Maintaining of Quantile Summaries

32

DELETE (vi)

DELETE(vi): To delete the tuple (vi, gi,Δi) from S, replace (vi, gi,Δi) and (vi+1, gi+1,Δi+1) by the new tuple (vi+1, gi+ gi+1,Δi+1), and decrement s.

Page 33: Online Computation and Continuous Maintaining of Quantile Summaries

33

Example of COMPRESS and DELETE

S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (1

2, 1, 0)}, s=6, n=6 Compress tuples (11, 1, 1) and (12, 1, 0) into a new tupl

e (12, 2, 0). S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, s

=5, n=6

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

25.0

Page 34: Online Computation and Continuous Maintaining of Quantile Summaries

34

Pseudo-Code for the whole algorithmInitial State

S; s 0; n 0;

AlgorithmTo add the n+1st element, v, to summary S(n):

if (n 0 mod 12) then

COMPRESS();

end if

INSERT (v);

n=n+1;

Page 35: Online Computation and Continuous Maintaining of Quantile Summaries

35

A Complete Example ( )

S={(10, 1, 0), (12, 1, 0)}, n=2 S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4 S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1),

(12, 1, 0)}, n=6, s=6 Perform compress when t6 comes. S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)},

n=6, s=5

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

25.0

Page 36: Online Computation and Continuous Maintaining of Quantile Summaries

36

A Complete Example ( )

S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 3), (12, 2, 0)}, n=8, s=7

Perform compress when t8 comes. S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)},

n=8, s=5

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

25.0

Page 37: Online Computation and Continuous Maintaining of Quantile Summaries

37

A Complete Example ( )

S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14, s=5

Perform compress S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4 Finally S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6,

0)}, n=16, s=6

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

t9

7

t10

8

t11

11

t12

4

t13

5

t14

2

t15

3

25.0

Page 38: Online Computation and Continuous Maintaining of Quantile Summaries

38

Outline of this talk

Quantile Estimation Overview GK-quantile Summary Algorithm

Data Structure Operations Space Complexity Analysis

Sliding Window Model

Page 39: Online Computation and Continuous Maintaining of Quantile Summaries

39

Band Property

Observe that the number of band and elements in a band determine the space complexity.

PROPOSITION 2: At any point in time n and for any 1, band(n) contains either 2 or 2-1 d

istinct values ofΔ. Since no more than 1 2 elements with any gi

venΔ are inserted, band is a summary of at most 2 2 elements in the stream.

Page 40: Online Computation and Continuous Maintaining of Quantile Summaries

40

LEMMAs

LEMMA 3: At any time n and for any given , there are at most 32 nodes in T(n) that have a child with band value of . Only a small number of nodes can have a child wit

h band . See Proposition 3.

Page 41: Online Computation and Continuous Maintaining of Quantile Summaries

41

LEMMAs

A full pair of tuples (ti-1, ti): band(ti-1) band(ti). The tuple ti-1 is left partner and ti is a right partner in this full pair.

LEMMA 4: At any time n and for any given , there are at most 4 tuples from band(n) that are right partners in a full tuple pair.

Page 42: Online Computation and Continuous Maintaining of Quantile Summaries

42

Full Pair Example

{(2,1,7), (3,1,7)} and is a full pair {(1,1,0), (2,1,7)} is not a full pair. (2,1,7) can only be a left partner!

(1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0)

R

Page 43: Online Computation and Continuous Maintaining of Quantile Summaries

43

Space Efficiency

Any band(n) node either is a right partner of a full pair, or can only be a left partner.

By Proposition 3, a band(n) node that can only be a left partner only occurs once for every parent of nodes from band(n).

By Lemma 3 and 4, the number of nodes in any band is bounded by 3 2 4 11 2.

Page 44: Online Computation and Continuous Maintaining of Quantile Summaries

44

Space Efficiency

The number of band is 1. THEOREM: At any time n, the total number of

tuples stored in S(n) is at most (11 2)log(2n).

GK-algorithm’s space complexity is logN.

Page 45: Online Computation and Continuous Maintaining of Quantile Summaries

45

Outline of this talk

Quantile Estimation Overview GK-quantile Summary Algorithm

Data Structure Operations Space Complexity Analysis

Sliding Window Model

Page 46: Online Computation and Continuous Maintaining of Quantile Summaries

46

Sliding Window Model

Under sliding window model, a summary is maintained for the most recently seen N data elements.

Eliminate exact out-dated elements requires a space of O(N).

Lin, etc. (ICDE 2004) proposed a space-efficient one-pass summary algorithm for sliding window model. Their underlying summary algorithm is GK-algorithm.

Page 47: Online Computation and Continuous Maintaining of Quantile Summaries

47

n-of-N Model

A summary is maintained for N most recently seen data elements. However, quantile queries can be issued against any n N. That is, for any (0,1], and any n N, we can return -quantiles among the n most recent elements in a data stream seen so far.

Lin, etc. (ICDE 2004) proposed their one-pass summary algorithm combining EH partitioning technique (Datar, etc. ACM-SIAM 2002) with GK-algorithm, solving n-of-N model.

Page 48: Online Computation and Continuous Maintaining of Quantile Summaries

48

Example of n-of-N model

Assume the sliding window is 16 in an n-of-N model. A quantile query can be answered for any 1 n 16.

0.5-quantile returns 6 for n=12 and 3 for n=4.

FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12.

t0

12

t1

10

t2

11

t3

10

t4

1

t5

10

t6

11

t7

9

t8

6

t9

7

t10

8

t11

11

t12

4

t13

5

t14

2

t15

3

Page 49: Online Computation and Continuous Maintaining of Quantile Summaries

49

Thank you!


Recommended