Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 216 times |
Download: | 0 times |
1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 12
June 18, 2006
http://www.ee.technion.ac.il/courses/049011
2
Data Streams
3
Outline
The data stream model Approximate counting Distinct elements Frequency moments
4
The Data Stream Model f: An B
A,B arbitrary sets n: positive integer (think of n as large) Given x 2 An, each entry xi is called an “element”. Typically, A,B are “small” (constant size) sets
Goal: given x An, compute f(x) Frequently, approximation of f(x) suffices Usually, will use randomization
Streaming access to input Algorithm reads input in “sequential passes” In each pass x is read in the following order: x1,x2,…,xn
Impossible: random access, go backwards Possible: store portions of x (or other functions of x) in memory
5
Complexity Measures Space
Objective: use as little memory as possible Note: if we allow unlimited space, data stream model is the
same as the standard RAM model Ideally, up to O(logc n) for some constant c
Number of passes Objective: use as few passes as possible Ideally, only a single pass Usually, no more than a constant number of passes
Running time Objective: use as little time as possible Ideally, up to O(n logc n) for some constant c
6
Motivation Types of large data sets:
Pre-stored Stored on magnetic or optical media: tapes, disks, DVDs,…
Generated on the fly Data feeds, streaming media, packet streams,…
Access to large data sets: Random access:
costly (if data is pre-stored) infeasible (if data is generated on the fly)
Streaming access: the only feasible option Resources:
Memory: the primary bottleneck Number of passes:
a few (if data is pre-stored) single (if data is generated on the fly)
Time: cannot be more than quasi-linear
7
Approximate Counting [Morris 77, Flajolet 85] Input: a bitstring x {0,1}n
Goal: find H = number of 1’s in x Naïve solution: just count them!
O(log H) bits of space Can we do better? Answer 1: No!
Information theory implies an (log H) lower bound Answer 2: Yes!
But only approximately: Output closest power of 1+ to H Note: # possible outputs is O(log1+ H) = O(1/ log H) Hence, only O(log log H + log(1/)) bits of space suffice
8
Approximate Counting ( = 1)
k 0 for i = 1 to n do
if xi = 1, then with probability 1/2k, set k k + 1 output 2k - 1
General idea: Expected # of 1’s needed to increment k to k + 1 is 2k
k = 0 k = 1: after seeing 1 one k = 1 k = 2: after seeing 2 additional 1’s k = 2 k = 3: after seeing 4 additional 1’s … k = i-1 k = i: after seeing 2i-1 additional 1’s Therefore, we expect k to become i after seeing
1 + 2 + 4 + … + 2i-1 = 2i – 1 1’s
9
Approximate Counting: Analysis
For m = 0,…,H, let: Km = value of counter after seeing m 1’s.
For i = 0,…,m, let pm,i = Pr[Km = i]
Recursion: p0,0 = 1 pm,0 = 0, for m = 1,…,H pm,i = pm-1,i (1 – 1/2i) + pm-1,i-1 1/2i-1, for m = 1,…,H, i = 1,
…,m-1 pm,m = pm-1,m-1 1/2m-1, for m = 1,…,H
10
Approximate Counting: Analysis
Define: Cm = 2Km
Lemma: E[Cm] = m + 1 Therefore, CH - 1 is an unbiased estimator for H Can show that Var[CH] is small, and therefore
w.h.p.
H/2 ≤ CH – 1 ≤ 2H.
Proof of lemma: By induction on m. Basis: E[C0] = 1, E[C1] = 2. Suppose m ≥ 2 and E[Cm-1] = m.
11
Approximate Counting: Analysis
12
Better Approximation
So far, factor 2 approximation How do we obtain a 1+ approximation?
k 0 for i = 1 to n do
if xi = 1, then with probability 1/(1 + )k, set k k + 1
output ((1 + )k – 1)/
13
Distinct Elements[Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02]
Input: a vector x {1,2,…,m}n
Goal: find D = number of distinct elements of x Example: if x = (1,2,3,1,2,3), then D = 3
Naïve solution: use a bit vector of size m, and track the values that appeared at least once O(m) bits of space
Can we do better? Answer 1: No!
If we want exact number, need (m) bits of space Information theory gives only (log m) Need communication complexity arguments
Answer 2: Yes! But only approximately: Use only O(log m) bits of space
14
Estimating the Size of a Random Set Suppose we choose D << M1/2 elements uniformly and
independently from {1,…,M}: X1 is uniformly chosen from {1,…,M} X2 is uniformly chosen from {1,…,M} … XD is uniformly chosen from {1,…,M}
For each k = 1,…,D, how many elements of {1,…,M} do we expect to be smaller than min{X1,…,Xk}? k = 1, we expect M/2 elements to be less than X1 k = 2, we expect M/3 elements to be less than min{X1,X2} k = 3, we expect M/4 elements to be less than min{X1,X2,X3} … k = D, we expect M/(D+1) elements to be less than min{X1,…,XD}
Conversely, suppose S is a set of randomly chosen elements from {1,…,M} whose size is unknown
Then, if t = min S, we can estimate |S| as M/t – 1.
15
Distinct Elements, 1st Attempt
Let M >> m2
Pick a random “hash function” h: {1,…,m} {1,…,M},
h(1),…,h(m) are chosen uniformly and independently from {1,…,M}
Since M >> m2, probability of collisions is tiny
min M for i = 1 to n do
if h(xi) < min, min h(xi) output M/min
16
Distinct Elements: Analysis
Space: O(log M) = O(log m) Not quite. We’ll discuss this later.
Correctness: Let a1,…,aD be the distinct values of x1,…,xn
S = { h(a1),…,h(aD) } is a set of D random and independent elements from { 1,…,M }
Note: min = min S Algorithm outputs M/(min S)
Lemma: With probability at least 2/3, D/6 ≤ M/min ≤ 6D.
17
Distinct Elements: Correctness
Part 1: show
Define for k = 1,…,D:
Define:
Note:
18
Markov’s Inequality
X 0: a non-negative random variable t > 1 Then:
Need to show: By Markov’s inequality,
19
Distinct Elements: Correctness
Part 2: show
Define for k = 1,…,D:
Define:
Note:
20
Chebyshev’s Inequality X: an arbitrary random variable > 0 Then:
Need to show: By Chebyshev’s inequality,
By independence of Y1,…,YD:
Hence,
21
How to Store the Hash Function?
How many bits needed to represent a random hash function h: [m] [M]? O(m log M) = O(m log m) bits More than the naïve algorithm!
Solution: use “small” families of hash functions H will be a family of functions h: [m] [M] |H| = O(mc) for some constant c Each h H, can be represented in O(log m) bits Need H to be “explicit”: given representation of h, can
compute h(x), for any x, efficiently. How do we make sure H has the “random-like”
properties of totally random hash functions?
22
Universal Hash Functions[Carter, Wegman 79]
H is a 2-universal family of hash functions if: For all x y [m] and for all z,w [M],
when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M2
Conclusions: For each x, h(x) is uniform in [M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise-
independent random variables k-universal families: straightforward
generalization
23
Construction of a Universal Family Suppose m = M and m is a prime power. [m] can then be associated with the finite field Fm Each two elements a,b Fm will define one hash function
in H |H| = |Fm|2 = m2
ha,b(x) = ax + b (operations in Fm) Note: if x y [m] and z,w [m], then ha,b(x) = z and
ha,b(y) = w iff
Since x y, the above system has a unique solution, and thus if we choose a,b at random the probability to hit the solution is exactly 1/m2.
24
Distinct Elements, 2nd Attempt Use a random hash function from a 2-universal family of
hash functions rather than a totally random hash function Space:
O(log m) for tracking the minimum O(log m) for storing the hash function
Correctness: Part 1:
h(a1),…,h(aD) are still uniform in [M] Linearity of expectation holds regardless of whether Z1,…,Zk are
independent or not. Part 2:
h(a1),…,h(aD) are still uniform in [M] Main point: variance of pairwise independent variables is additive:
25
Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm:
Find the t = O(1/2) smallest elements, rather than just the smallest one.
If v is the largest among these, output tM/v
Space: O(1/2 log m) Better algorithm: O(1/2 + log m)
26
Frequency Moments[Alon, Matias, Szegedy 96]
Input: a vector x {1,2,…,m}n
Goal: find Fk = k-th frequency moment of x
For each j {1,…,m}, fj = # of occurrences of j in x Ex: if x = (1,1,1,2,2,3) then f1 = 3, f2 = 2, f3 = 1
Examples F1 = n (counting) F0 = distinct elements F2 = measure of “pairwise collisions” Fk = measure of “k-wise collisions”
27
Frequency Moments: Data Stream Algorithms
F0: O(1/2 + log m)
F1: O(log log n + log(1/)) F2: O(1/2 (log m + log n))
Fk, k > 2: O(1/2 m1-2/k)
28
End of Lecture 12