Bloom Filters for K-mer Counting CSCIE58 JLIN

Slide 1

Bloom Filters for K-mer CountingJon Z. LinCSCI E-58May 13, 2015https://youtu.be/HUc_wmFQlbQI am Jon and I am here to talk about Bloom filters for K-mer counting1MotivationProblem: K-mer counting (pattern matching) simple yet demanding in memory and speed.Solution: Bloom filters (70)Question: is it useful?

http://www.imdb.com/title/tt0832266/

K-mer counting is a variation on the repeated pattern search we have seen in class. It is used, for example, in sequence assembly to toss out non repeating k mers, which are thought to be defective reads.

Remember that the solution can be nave, but as the search space grows in size, nave solutions fail to complete in reasonable time or run out of memory.

I came across a 2011 paper that claimed bloom filters can solve the problem with half the memory.

Bloom Filters are 3 years older suffix trees. 2What does it do?Space-efficient and probabilisticProbable positives + false positivesNo false negatives (100% recall!)BloomFilterPatterns of length k (k-mers)Definitely Not HereorMaybe here132m132m132m132m132m132k132kSo what can it do for us? a memory efficient way to test set membership. For example, every single Chrome browsers uses Bloom filters to store all known malicious URL and to give user warnings by testing the membership of URLs.

In k-mer counting, we can test any k-mer to tell us one of two thing: either the k-mer is definitely not a repeat, or the k-mer is probably a repeat. 3How does it work?https://www.jasondavies.com/bloomfilter/

https://youtu.be/AgvJUfTviHoSimple hashingHashing Function132k132m132kBit 0Bit 1Pattern of length k(k-mer)Digest of length m(hash)

Number of patternsNumber of bits(log scale)k = 30p = 0.001In order to understand this quirky maybe answer, we take a look at its construction.It is built on the simple hashed digests.

So, with simple hashing, we can reduce the k-mers down to a shorter sequence of bits.

As I will explain in the next couple slides, hashing creates a probability that two different k-mers might be hashed into the same digest, thereby creating a collision condition. And if we are to use the digest to test set membership, we would have false positives.

The figure here shows the reduction in memory requirement for 30-mers while maintaining a false positive rate of 0.1%.5Consequence: false positivesCollision: two patterns xi and xj, where xi xj, are hashed to the same key, soh(xi) = h(xj), even though xi xj

good hashing gives:Prob[h(x)] = 1 - ()mProb[h(xi) = h(xj)] = 1 - (Prob[h(x)])n = 1 - (1- ()m)nAgain, we can write the probability of hash collision in a compact form. 6A little bit of algebrad = n log2(1 / [1 (1 - p)1/n])(ddigest)

d = m n

retained sizenumber of patterns (k-mers)probability of hash collision (false positive)digest sizeAnd we can derive the memory requirement d as a function of the size of the genome n and the probability of false positives p.7Hash + bit array = 1 filterOne step forward, one step backDigestFunction132k132m132kBit 0Bit 1Sequence of length k(k-mer)Digest of length m(hash)132mHashing n m-digests into bit arraye.g. n = 1,000,000132m132m132m132m132mHashing toBit index 132bBit array of length 2mOnce we have the digests, with another level of indirection, we can store the set membership in a bit-array, by hashing the digests into a linear array, where 0 indicates a negative set membership, and 1 indicates a probable positive.

We will call this digest + bit array structure a filter.8Digest v. single filter

p = 0.001bit array size = 2mNumber of patternsNumber of bits(log scale)When we do this, that is combine a digest representation of the k-mers and a bit array to keep track of the set memberships, we can, in fact lose the space savings. Why? Because at this point we have not optimized the size of the bit array (2^m).

Green is the filter and red is the digest, and we can see the memory cost of adding the bit array.9Another step forward: multiple hashing functions132mHashing n m-digests into bit arraye.g. n = 1,000,000132m132m132m132m132mHashing toArray index Hashing toArray index Hashing toArray index 132bBit array of length 2mHere is where Bloom filter begins to emerge, it uses multiple hashing functions to map the digests to the same bit array.This sounds expensive because you are adding computing time.10Digest => 1 filter => multi-filters

Number of patternsNumber of bits(log scale)But what you get back is memory savings. Multiple filters use less memory that simple digests.11Savings so fard = n log2(1 / [1 (1 - p)1/n])(ddigest)d = 1 / [1 (1 - p)1/n](dsingle filter)d = 1 / [ 1 (1 p1/f)1/f n](dmultiple filter)

Optimization choices:User chooses storage size (d), find the optimal number of filters (f) to minimize probability of false positive (p)User chooses probability of false positive, find optimal number of filters (f) to minimize storage size (d)

How does this work? If we are to choose the hashing functions properly, so that the hashing functions are independent of one another, we can use much smaller digests for the kmers and smaller bit array to store the set membership.

In fact we have 2 optimization choices, we can specify the probability of false positive and find the optimal number of filters to minimize the memory footprint, or dictate the memory size and minimize the probability of false positives.12Space savingsNumber of K-mersNumber of bits(log scale)

optimal f (d/n) log2 7http://dl.acm.org/citation.cfm?id=1070548p = 0.001We can do this optimally for p = 0.1% see that the optimal number of filters gives us, in orange, the best memory savings so far.13Bloom FiltersChoose p, optimal f (d/n) log2

d = (n/log2) log(1/p)

For each k-mer:1.44 log(1/p) bits theoretical limit:log(1/p)bits

To understand just how good this memory efficiency is, one can derive, that the theoretical filter storage of a k-mer takes log(1/p) bits and an optimally tuned bloom filter takes 1.44 times that.So pretty good!14Time and Space

http://i.kinja-img.com/gawker-media/image/upload/s--N4QNKzS5--/18lruniwftax7jpg.jpgThe other problem we have observed is that even if we can be memory efficient, we might not be able to complete the analysis fast enough.15Time efficiency

Number of patterns (30-mers)Seconds(log scale)https://github.com/jaybaird/python-bloomfilterhttps://code.google.com/p/pysuffix/Here, I concocted a few simple experiments. This figure compares a Bloom Filter solution with a suffix array method with a nave solution to find repeated 30-mers.

At least in this rough experiment, Bloom filters and suffix arrays are comparable in execution time and both gives 2 orders of magnitude time savings over a nave method.16Time

http://nbviewer.ipython.org/github/ged-lab/2013-khmer-counting/blob/master/notebook/khmer-counting.ipynbBloom FilterKMC/scTurtleGreat minds think alike and I am just a little slow to the game.

Here is benchmark of a number of current kmer counting methods.

17Memory

http://nbviewer.ipython.org/github/ged-lab/2013-khmer-counting/blob/master/notebook/khmer-counting.ipynbBloom FilterSo Bloom filters are the the fastest, but they are the most memory efficient. So if you would like to tinker with k-mer counting on your own laptop, Bloom filter might be your best choice.18

Date post:	19-Sep-2015
Category:	Documents
Upload:	jonzlin
View:	222 times
Download:	0 times

Bloom Filters for K-mer Counting CSCIE58 JLIN

Documents