Date post: | 03-Jul-2015 |
Category: |
Data & Analytics |
Upload: | open-world-forum |
View: | 169 times |
Download: | 5 times |
Abstract Algebra for Analytics
Sam BESSALAH
@samklr
What do we want?
• We want to build scalable systems.
• Preferably by leveraging distributed computing
• A lot of analytics amount to counting or adding in some sort of way.
• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
11, 12, 0,3,56,48 K=3 56,48,12
• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
Hadoop Map-Reduce
• Example : Finding TopK Elements
Read Input
Sort, Filter and take top K records
Write Output
Hadoop Map-Reduce
In Scalding
In Scalding
Problems
• Curse of the last reducer
• Network Chatter, hinder on performance
• Inefficient Order for map and reduce steps
• Multiple jobs, with a sync barrier at the reducer
But in Scalding, « sortWithTake » uses :
But in Scalding, « sortWithTake » uses :
Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative
PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
But in Scalding, « sortWithTake » uses :
Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative
PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
In a single Pass
Why is it better and faster?
Associativity allows parallelism
Do we have data structures that are intrinsically parallelizable?
Abstract Algebra Redux
• Semi Group
Associative Set (Grouping doesn’t matter)
• Monoid
Semi Group with a zero (Zeros get ignored)
• Group
Monoid with inverse
• Abelian Group
Commutative Set (ordering doesn’t matter)
Stream mining challenges
• Update predictions after every observation
• Single pass : can’t read old data or replay the stream
• Limited time for computation per observation
• O(n) memory size
Existing solutions
• Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory.
• Stream subsampling
• Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees
• Use time series analysis methods …
• Etc
Approximate algorithms for stream analytics
Idea : Hash, don’t Sample
Bloom filters
• Approximate data structure for set membership
• Like an approximate set
BloomFilter.contains(x) => Maybe | NO
P(False Positive) > 0
P(False Negative) = 0
• Bit Array of fixed size
add(x) : for all element i, b[h(x,i)]=1
contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
• Bloom Filters
Adding an element uses a boolean OR
Querying uses a boolean AND
Both are Monoids
HyperLogLogard
Intuition
• Long runs of trailings 0 in a random bits chain are rare
• But the more bit chains you look at, the more likely you are to find a long one
• The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
HyperLogLog
• Popular sketch for cardinality estimation
HLL.size = Approx[Number]
We know the distribution on the error.
http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
• HyperLogLog
Adding an element uses MAX, which is a
monoid (Ordered Semi Group really ...)
Querying use an harmonic sum : Monoid.
Min Hash
• Gives the probability of two sets being similar.
• Essentially amounts to
P(A ∩ B) / P(A U B)
• Jaccard Similarity
Count min Sketch
Gives an approximation of the number of occurrences of an element in a set.
• Count min sketch
Adding an element is a numerical addition
Querying uses a MIN function.
Both are associative.
Anomaly Detection
- Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data.
- Many exist : Q-Tree, Q-Digest, T-Digest
- All of those are associative.
- Another neat thing : types your data uniformaly.
Many more sketches and tricks
• FM Counters, KMV
• Histograms
• Ball Sketches : streaming k-means, clustering
• SGD : fit online machine learning algorithms
Algebird
Conclusion
• Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers
• As data size grows, sampling becomes painful, hashing provide better cost effective solution
• Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems.
http://speakerdeck.com/samklr
DON’T BE SCARED ANYMORE.
Bibliography
• Great intro into Algebird
http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/
• Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
• Probabilistic data structures for web analytics.
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
Algebird : github.com/twitter/algebird
Algebra for analytics https://speakerdeck.com/johnynek/algebra-for-analytics
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf