Post on 09-Dec-2014
description
transcript
1
Bloom Filter
xuanzi.wp@taobao.com
2011-11-18
2
• A Membership Query Problem
• What is Bloom Filter
• BloomFilter Math Theory
• Compression
• Application Scenario
Agenda
3
Problem Description
Given an element E, query whether it
belongs to an big elements set S.
– Fast as soon as possible
– Small as soon as possible
Membership Query Problem
4
Some Solutions
hashtable
fast but big data structure
bitmap index
can be smaller?
Membership Query Problem
5
Tradeoff Solutions
To obtain speed and size improvements,
allow some probability of error.
Bloom Filter
Membership Query Problem
6
Support approximate set membership Given a set S = {x1,x2,…,xn}, construct data
structure to answer queries of the form “Is y in S?”
Data structure should be:–Fast (Faster than searching through S).–Small (Smaller than explicit representation).
To obtain speed and size improvements, allow some probability of error.
–False positives: y S but we report y S–False negatives: y S but we report y S
What is Bloom Filter
7
What is Bloom Filter
7
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0BPossible to have a false positive; all k values are 1, but y is not in S.
n items m = cn bits k hash functions
What is Bloom Filter
False Positive
8
A
0
0
1
0
1
0
0
0
0
1
0
hash1
hash2
hash3B
Bloom Filter Math Theory
9
Pr(specific bit of filter is 0) is
If is fraction of 0 bits in the filter then false positive probability is
Approximations valid as is concentrated around E[].
–Martingale argument suffices. Find optimal at k = (ln 2)m/n by calculus.
–So optimal fpp is about (0.6185)m/n
pmp mknkn /e)/11('
kckkkk pp )e1()1()'1()1( /
n items m = cn bits k hash functions
Bloom Filter Math Theory
10
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Fal
se p
osit
ive
rate
Opt k = 8 ln 2 = 5.45...m/n = 8
n items m = cn bits k hash functions
Bloom Filter Compression
Use BF on Network Transmission
BF as a message, should be small
enough
to transmitted over the network
Compressing bit vector is easy
Arithmetic coding gets close to entropy.
Can Bloom filters be compressed?
11
Bloom Filter Compression
• Optimize to minimize false positive.
• At k = m (ln 2) /n, p = 1/2.
• Bloom filter looks like a random string.– Can’t compress it.– H(p) = -plog2p – (1-p)log2(1-p)
12
mknkn emp /)/11(]empty is cellPr[ kmknk epf )1()1(]pos falsePr[ /
nmk /)2ln(
Bloom Filter Compression With more decompressed size (storage),
we can achive compression.
13
• Assumption: optimal compressor, z = mH(p). – H(p) is entropy function; optimally get
H(p) compressed bits per original table bit.– Arithmetic coding close to optimal.
• Optimization: Given z bits for compressed filter and n elements, choose table size m and number of hash functions k to minimize f. )(;)1(; // pmHzefep kmknmkn
Bloom Filter Compression
1414
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Fal
se p
osit
ive
rate
z/n = 8Original
Compressed
Bloom Filter Compression
• At k = m (ln 2) /n, false positives are maximized with a compressed Bloom filter.– Best case without compression is worst case
with compression; compression always helps.
– Side benefit: Use fewer hash functions with compression; possible speedup.
1515
Conclusion
Application Scenario
Speed up answers in a key-value like syetem
16
filter(memory)
storage(memory)key1
no
key2yes
disk accesssuccess
key3yes
disk accessfail
Application Scenario
Web Cache
17
cache1 cache2 cache3……
Web Server
Q&A
18
Q&A