Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters Compact data structures: Bloom filters
Luca Becchetti
“Sapienza” Universita di Roma – Rome, Italy
April 7, 2010
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
1 Dictionaries and Bloom filters
2 The maths of Bloom filters
3 Applications of Bloom filters
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Dictionaries
A dynamic set S of objects from a discrete universe U, onwhich (at least) the following operations are possible:
Item insertion
Item deletion
Set memberhisp: decide whether item x ∈ S
Typically, it is assumed that each element in S is uniquelyidentified by a key. Let obj(k) be object with key k :
Operations
insert(x, S): insert item xdelete(k, S): delete item whose key is kretrieve(k, S): retrieve obj(k)
This is a minimal set of operations. Any database implementsa (greatly augmented) dictionary
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Testing for membership
Dictionaries are often large or huge in many applications
Any of the operations above potentially involves accessto secondary storage
Set membership
Retrieval (deletion) can be restated as follows:if obj(k) ∈ S then retrieve(k, S) (delete(k,S))
Set membershipismember(k, S): if false then obj(k) 6∈ S .
Why this: membership can be tested efficiently usingcompact data structures
Check often in main memory
No need to access secondary storage if false
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Example: spell-checker
Provide first level of spell checking for a text editor
Must quickly report spell mistakes to user
Exact check
Need efficient data structure
Trees are typically used
Terms correspond to nodes (typically leaves) of the tree
Thesaurus in the order of 105 - 106 terms
May be too large for quick response times
Idea: trade accuracy for efficiency
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Bloom filters
Used to provide a compact summary of a set of keysKey k hashed t times on [m] = {0, . . . ,m − 1} using t“independent” hash functionsBinary array B of size m (m typically a prime)For the moment: only insertions and set membership
0
m-1h1(k)
h2(k)
ht(k)
k 1
1
1
Bloom filter
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Use of Bloom filters (object retrieval)
Bloomfilter
retrieve(k)
ismember(k)
true
obj(k)
Time
Database
1
2
3
4
Main memory
Potential savings for retrieval (insertion/deletion)
- (3) and (4) do not occur if ismember(k) returns false- Bloom filter stored in main memory
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Bloom filters: insertion and set membership
insert(k)
Require: k: object key1: for j : 1 . . . t do2: i = hj(k)3: if Bi == 0 then4: Bi = 15: end if6: end for
ismember(k)
Require: k: object key1: member = true; j = 12: while member == true
&& j <= t do3: i = hj(k)4: if Bi == 0 then5: member = false6: end if7: j = j + 18: end while9: return member
Figure: Bloom filter: insertion and set membership (S is implicit)
Initially, Bi = 0 for every i
B is a compact summary of keys of elements in S
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
False positives
- No false negatives but...- Assume h1(k) = 2k + 1 mod 5, h2(k) = x + 2 mod 5- ismember(4) returns true → false positive
0
4
h1(k)
h2(k)
2
t = 2 and m = 5: Insertion of keys (5, 2, 3)
3
2
13
5
1
0
1
1
1
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
1 Dictionaries and Bloom filters2 The maths of Bloom filters3 Applications of Bloom filters
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
The mathematics of Bloom filters
Having false positives means that we might accessdatabase even if it contains no element with searched key
Can be acceptable if P[false positive] small
Probability of false positives
Assume n elements in the Bloom filter
Assume every hj(·) “ideal”, i.e., it hashes every itemuniformly at random and independently of the others (forthe sake of the analysis)
Consider ismember(k), with obj(k) 6∈ S
What is P[ismember(k) == true]?
“Small” if m large enough
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Fraction of 0’s
Assume ideal h(·)’s
Assume that, after n insertions, fraction of 0’s in B is p
Consider k 6∈ B:
P[ismember(k) == true] = (1− p)t
The fraction of 0’s determines the probability of a falsepositive
p is itself a random variable that depends on t and m
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Fraction of 0’s cont.
The Bi ’s are random variables that depend on the inputand the hash functions
After n insertions we have:
P[Bi = 0] =
(1− 1
m
)tn
E[p] =1
m
m−1∑i=0
P[Bi = 0] =
(1− 1
m
)tn
≈ e−tn/m
if X = number of 0’s then X = mp and E[X ] = mE[p]
Theorem ([Mitzenmacher, 2002])
Let X denote the number of 0’s in Bloom filter after ninsertions.
P[|p − E[p] | > ε] = P[|X −mE[p] | > εm] ≤ 2e−2ε2m2/tn
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Fraction of 0’s cont.
The Bi ’s are not statistically independent (why?)
Proof uses an extension of Chernoff bounds
Remarks
Note that p is very close to E[p] with high probability.
Example: if m ≥ 17√
nt, p ∈ [0.9E[p] , 1.1E[p]] withprobability at least 99% → verify
In practice (see further) condition above or similar easyto satisfy
In the rest of this section we assume thatp ≈ E[p] ≈ e−tn/m deterministically
This can be made rigorous at the cost of somecomplication in the analysis
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Choice of m and t
We have seen that with good approximation:
P[ismember(k) == true] = (1− p)t ≈ (1− e−tn/m)t
We can play with parameters m (size of Bloom filter) andt (number of hash functions)
In the remainder of the analysis, we fix m and minimizethe expression f (t) = (1− e−tn/m)t w.r.t. t (n is given,m is fixed)
We next take g(t) = ln f (t) = t ln(1− e−tn/m).Minimizing f (t) is equivalent to minimizing g(t) but thelatter is easier
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Choice of m and t cont.
We have:
dg
dt= ln(1− e−tn/m) +
tn
m
e−tn/m
1− e−tn/m
Derivative is 0 when t = m ln 2n and this is a global
minimum
With this choice:P[ismember(k) == true] ≈ f (t) = 1
2t ≈ (0.6185)mn
Of course, the number t of hash functions has to be aninteger
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Recap
n is given
For any given m, t = m ln 2n ideally,
⌈m ln 2
n
⌉or⌊
m ln 2n
⌋in
practice
Bloom filters highly effective if m = cn, with c a smallconstant
Example: c = 8, t = 5 or 6 → false positive probability≈ 0.02
Fixing m: in practice, choose a value a few times higherthan the max predictable size of your databse
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Recap cont.
Assume database with n = 106 documents, keys aredocument digests of size 1Kbit each → 256 MBytes
A retrieve operation can be very expensive, caching canonly in part mitigate
Using m = 8n, we have a 1MB size Bloom filter thatoccupies an only small fraction of main memory
Still missing...
Deletions
Can be implemented at the expense of a moderateincrease in memory
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Handling deletions
Substitute binary array with counter array (countingBloom filter)
0
4
h1(k)
h2(k)
2
Counting Bloom filter with t = 2 and m = 5: Insertion of keys (5, 2, 3)
3
2
13
5
1
0
2
1
2
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Counting Bloom filters: insertion and deletion
insert(k)
Require: k: object key1: for j : 1 . . . t do2: i = hj(k)3: Ci = Ci + 14: end for
delete(k)
Require: k: object key1: if ismember(k) then2: for j : 1 . . . t
do3: i = hj(k)4: Ci = Ci − 15: end for6: end if
Figure: Counting Bloom filters: insertion and deletion (S is implicit)
Possible to prove that 4 bits per counter suffice for mostapplications [Broder and Mitzenmacher, 2004]
ismember(k) unchanged
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Applications
[Broder and Mitzenmacher, 2004]
Databases maintenance (since the early 80’s)
Cooperative distributed caching (see also[Fan et al., 2000])
P2P/Overlay networks
Resource routing
Packet routing
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Summary cache [Fan et al., 2000]
Internet Caching Protocol (ICP)
Proxies cooperate
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Summary cache cont.
On a cache miss, a proxy contacts its neighbour proxiesinstead of requesting the page from Web server
ICP traffic can cause great overhead even for few proxies
Idea
Each proxy stores a (counting) Bloom filter of everyother proxy’s contents
Keys are the URLs
On a cache miss:1 Check locally stored Bloom filters for key membership2 Contact a proxy whose relevant Bloom filter is positive
for the key
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Questions
Q1
Consider two dictionaries over the same universe ofobjects (and therefore keys)
Describe how and why Bloom filters allow to easilyconstruct a compact summary of their union
Q2
Dictionary in secondary storage with n items, noinsertions/deletions
retrieve(k) costs ∆ (time to access disk)
Access to main memory negligible
70% of requested items not in dictionary
Let T be response time
Design a Bloom filter such that ∆E[T ] ≥ 2, i.e., a 100%
speed-up
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Example: spell-checker
Text editor spell-checker
Must quickly report spell mistakes to user
Thesaurus contains 105 terms
Average term length: 10 bytes
Design a Bloom filter that performs spell - checking withprobability of error 0.01
Solution
Impose that (0.6185)mn ≤ 0.01 → m
n ≥ 9.59
t = mn ln 2 ≈ 6.65
We can use a Bloom filter of size ≈ 1Mbit using 7 hashfunctions
Note that storing all words requires 1Mbyte + datastructure
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Example: spell-checker
Text editor spell-checker
Must quickly report spell mistakes to user
Thesaurus contains 105 terms
Average term length: 10 bytes
Design a Bloom filter that performs spell - checking withprobability of error 0.01
Solution
Impose that (0.6185)mn ≤ 0.01 → m
n ≥ 9.59
t = mn ln 2 ≈ 6.65
We can use a Bloom filter of size ≈ 1Mbit using 7 hashfunctions
Note that storing all words requires 1Mbyte + datastructure
Bloom filters
L. Becchetti
Dictionaries andBloom filters
The maths ofBloom filters
Applications ofBloom filters
Broder, A. and Mitzenmacher, M. (2004).
Network applications of bloom filters: A survey.
In Internet Mathematics, A K Peters, Ltd., volume 1.
Fan, L., Cao, P., Almeida, J., and Broder, A. Z. (2000).
Summary cache: a scalable wide-area Web cache sharingprotocol.
IEEE/ACM Transactions on Networking, 8(3):281–293.
Mitzenmacher, M. (2002).
Compressed bloom filters.
IEEE/ACM Transactions on Networking, 10(5):604–612.