Date post: | 17-Jun-2015 |
Category: |
Technology |
Upload: | kira |
View: | 365 times |
Download: | 1 times |
Bloom Filters
Kira Radinsky
Slides based on material from:
Michael Mitzenmacher and Hanoch Levy
Motivation - Cache
• Lookup questions: Does item “x” exist in a set?
• Data set may be very big or expensive to access. Filter lookup questions with negative results before accessing data.
• Allow false positive errors, as they only cost us an extra data access.
• Don’t allow false negative errors, because they result in wrong answers.
Application of Bloom Filters: Distributed Web Caches
Web Cache 1 Web Cache 2 Web Cache 3
Web Cache 6Web Cache 5Web Cache 4
• Send Bloom filters of URLs.• False positives do not hurt much.
– Get errors from cache changes anyway
Web Caching
• Summary Cache: [Fan, Cao, Almeida, & Broder]
If local caches know each other’s content...
…try local cache before going out to Web
• Sending/updating lists of URLs too expensive.
• Solution: use Bloom filters.
• False positives– Local requests go unfulfilled.
– Small cost, big potential gain
The Problem Solved by BF:Approximate Set Membership
• Lookup Problem: Given a set S = {x1,x2,…,xn}, construct data structure to answer queries of the form “Is y in S?”
• Data structure should be:
– Fast (Faster than searching through S).
– Small (Smaller than explicit representation).
• To obtain speed and size improvements, allow some probability of error.
– False positives: y S but we report y S
– False negatives: y S but we report y S
Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, set B[a] = 1.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
To check if y is in S, check B at Hi(y). All k values must be 1.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Possible to have a false positive; all k values are 1, but y is not in S.
Bloom Filter
01000 10100 00010
x
h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
Advantages
• No Overflow
• Union and intersection of Bloom filters
– A simple bitwise OR and AND operations
• Applications:
– Google BigTable
– The Squid Web Proxy Cache uses Bloom filters for cache digests.
Bloom Errors
01000 10100 00010h1(x) h2(x) hk(x)
V0 Vm-1
h3(x)
a b c d
x didn’t appear, yet its bits are already set
Example
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Fa
lse p
osi
tiv
e r
ate
m/n = 8
Opt k = 8 ln 2 = 5.45...
Tradeoffs
• Three parameters.
– Size m/n : bits per item.
• |U| = n: Number of elements to encode.
• hi: U[1..m] : Maintain a Bit Vector V of size m
– Time k : number of hash functions.
• Use k hash functions (h1..hk)
– Error f : false positive probability.
Bloom Filter Tradeoffs
• Three factors: m,k and n.
• Normally, n and m are given, and we select k.
• Small k– Less computations.
– Actual number of bits accessed (nk) is smaller, so the chance of a “step over” is smaller too.
– However, less bits need to be stepped over to generate an error.
• For big k, the exact opposite holds.
• Not surprisingly, when k is optimal, the “hit ratio” (ratio of bits flipped in the array) is exactly 0.5
Alternative Approach for Bloom Filters: Perfect Hashing Approach
Element 1 Element 2 Element 3 Element 4 Element 5
Fingerprint(4) Fingerprint(5) Fingerprint(2) Fingerprint(1) Fingerprint(3)
Perfect Hashing Approach
• Folklore Bloom filter construction.– Recall: Given a set S = {x1,x2,x3,…xn} on a universe U, we want
to answer membership queries.
– Method: Find an n-cell perfect hash function for S.• Maps set of n elements to n cells in a 1-1 manner.
– Then keep bit fingerprint of item in each cell. Lookups have false positive < e.
– Advantage: each bit/item reduces false positives by a factor of 1/2, vs ln 2 for a standard Bloom filter.
• Negatives:– Perfect hash functions non-trivial to find.
– Cannot handle on-line insertions.
)/1(log 2 e
Bloom Filters and Deletions
• Cache contents change– Items both inserted and deleted.
• Insertions are easy – add bits to BF
• Can Bloom filters handle deletions?
– Use Counting Bloom Filters to track insertions/deletions at hosts;
– Send Bloom filters.
Handling Deletions
• Bloom filters can handle insertions, but not deletions.
• If deleting xi means resetting 1s to 0s, then deleting xi will “delete” xj.
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
xi xj
Counting Bloom Filters
Start with an m bit array, filled with 0s.
Hash each item xj in S k times. If Hi(xj) = a, add 1 to B[a].
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 3 0 0 1 0 2 0 0 3 2 1 0 2 1 0B
To delete xj decrement the corresponding counters.
0 2 0 0 0 0 2 0 0 3 2 1 0 1 1 0B
Can obtain a corresponding Bloom filter by reducing to 0/1.
0 1 0 0 0 0 1 0 0 1 1 1 0 1 1 0B
Counting Bloom Filters: Overflow
• Must choose counters large enough to avoid overflow.
• Poisson approximation suggests 4 bits/counter.– Average load using k = (ln 2)m/n counters is ln 2.
– Probability a counter has load at least 16:
• Failsafes possible.
17E78.6!16/)2(ln 162ln e
Variations and Extensions
• Distance-Sensitive Bloom Filters
• Bloomier Filter
Extension: Distance-Sensitive Bloom Filters
• Instead of answering questions of the form
we would like to answer questions of the form
• That is, is the query close to some element of the set, under some metric and some notion of close.
• Applications:– DNA matching– Virus/worm matching– Databases
• Some initial results [KirschMitzenmacher]. Hard.
.SyIs
.SxyIs
Extension: Bloomier Filter
• Bloom filters handle set membership.
• Counters to handle multi-set/count tracking.
• Bloomier filter [Chazelle, Kilian, Rubinfeld, Tal]:– Extend to handle approximate functions.
– Each element of set has associated function value.
– Non-set elements should return null.
– Want to always return correct function value for set elements.
– A false positive returns a function value for a non-null element.