Date post: | 02-Jul-2015 |
Category: |
Software |
Upload: | samir-bessalah |
View: | 2,028 times |
Download: | 6 times |
Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr
Algebird
Abstract Algebrafor
Analytics
Sam BESSALAH
@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
Abstract Algebra
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr
From WikiPedia
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Algebraic Structure
“ Set of values, coupled with one ormore finite operations,and a set oflaws those operations must obey. “
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Algebraic Structure
“ Set of values, coupled with one or morefinite operations, and a set of laws thoseoperations must obey. “
e.g Sum, Magma, Semigroup, Groups, Monoid,Abelian Group, Semi Lattices, Rings, Monads,etc.
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Semigroup
Semigroup Law :
(x <> y) <> z = x <> (y <> z)(associativity)
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Semigroup
Semigroup Law :(x <> y) <> z = x <> (y <> z)
(associativity)
trait Semigroup[T] {def aggregate(x : T, y : T) : T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Monoids
Monoid Laws :(x <> y) <> z = x <> (y <> z)
(associativity)
identity <> x = xx <> identity = x
(identity)
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Monoids
Monoid Laws :(x <> y) <> z = x <> (y <> z)
(associativity)identity <> x = xx <> identity = x
(identiy / zero)
trait Monoid[T] {def identity : Tdef aggregate (x, y) : T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Monoids
Monoid Laws :(x <> y) <> z = x <> (y <> z)
(associativity)identity <> x = xx <> identity = x
trait Monoid[T] extends Semigroup[T]{
def identity : T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Groups
Group Laws:(x <> y) <> z = x <> (y <> z)
(associativity)
identity <> x = xx <> identity = x
(identity)
x <> inverse x = identityinverse x <> x = identity
(invertibility)
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
GroupsGroup Laws
(x <> y) <> z = x <> (y <> z)identity <> x = xx <> identity = x
x <> inverse x = identityinverse x <> x = identity
trait Group[T] extends Monoid[T]{def inverse (v : T) :T
}
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Many More
- Abelian groups (Commutative Sets)- Rings- Semi Lattices- Ordered Semigroups- Fields ..
Many of those are in Algebird ….
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Examples- (a min b) min c = a (b min c) with Int.- a max ( b max c) = (a max b) max c **- a or (b or c) = (a or b) or c- a and (b and c) = (a and b) and c- int addition- set union- harmonic sum- Integer mean- Priority queue
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Why do we need those algebraic structures ?
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
We want to :
- Build scalable analytics systems
- Leverage distributed computing to perform aggregation on really large data sets.
- A lot of operations in analytics are just sorting and counting at the end of the day
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Distributed Computing → Parallellism
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Distributed Computing → Parallellism
Associativity → enables parallelism
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Distributed Computing → Parallellism
Associativity enables parallelism
Identity means we can ignore some data
Commutativity helps us ignore order
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Typical Map Reduce ...
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Finding Top-K Elements in Scalding ...
class TopKJob(args : Args) extends Job (args) {Tsv ( args(‘input’), visitScheme)
.filter (. ..)
.leftJoinWithTiny ( … )
.filter ( … )
.groupBy( ‘fieldOne){ _.sortWithTake (visitScheme -> top } (biggerSale)
.write(Tsv(...) ) }
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
.sortWithTake( … )
Looking into .sortWithTake in Scalding, there’s one nice thing :
class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]
Let’s take a look :PQ1 : 55, 45, 21, 3PQ2: 100, 80, 40, 3top-4 (PQ1 U PQ2 ): 100, 80, 55, 45
Priority Queue :Can be emptyTwo Priority Queues can be “added” in any orderAssociative + Commutative
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]
Let’s take a look :PQ1 : 55, 45, 21, 3PQ2: 100, 80, 40, 3top-4 (PQ1 U PQ2 ): 100, 80, 55, 45
Priority Queue :Can be emptyTwo Priority Queues can be “added” in any orderAssociative + Commutative
Makes Scalding go fast,
by doing sorting, filtering and extracting in one single “map” step.
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Stream Mining Challenges
- Update predictions after each observation- Single pass : can’t read old data or replay
the stream- Full size of the stream often unknown- Limited time for computation per
observation- O(1) memory size
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Stream Mining Challenges
http://radar.oreilly.com/2013/10/stream-mining-essentials.html
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Tradeoff : Space and speed over accuracy.
Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Tradeoff : Space and speed over accuracy.
use sketches.
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Sketches
Probabilistic data structures that store a summary(hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of thetime, sublinear algorithmic properties.
E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Bloom filters
Approximate data structure for set membershipBehaves like an approximate set
BloomFilter.contains(x) => NO | MaybeP(False Positive) > 0P(False Negative) = 0
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Internally :Bit Array of fixed size
add(x) : for all element i, b[h(x,i)]=1contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
(Boolean AND => associative)Both are associative => BF can be designed as a Monoid
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Bloom filters
import com.twitter.algebird._import com.twitter.algebird.Operators._// generate 2 listsval A = (1 to 300).toList// Generate a Bloomfilterval NUM_HASHES = 6val WIDTH = 6000 // bitsval SEED = 1implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)// approximate set with bloomfilterval A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _)val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…)
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Count Min Sketch
Gives an approximation of the number of occurrences of an element in a set.
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Count Min Sketch
Count min sketchAdding an element is a numerical additionQuerying uses a MIN function.Both are associative.
useful for detecting heavy hitters, topK, LSH
We have in Algebird :
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
HyperLogLog
Popular sketch for cardinality estimtion.Gives within a probilistic distribution of an error the number of distinct values in a data set.
HLL.size = Approx[Number]
IntuitionLong runs of trailings 0 in a random bitschain are rareBut the more bit chains you look at, the more likely you are to find a long oneThe longest run of trailing 0-bits seen can bean estimator of the number of unique bit chainsobserved.
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Adding an element uses a Max and Sum function.Both are associative and Monoids. (Max is an ordered
semigroup in Algebird really)
Querying for an element uses an harmonic meanwhich is a Monoid.
In Algebird :
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Many More juicy sketches ...
- MinHashes to compute Jaccard similarity- QTree for quantiles estimation. Neat for anomaly
detection.- SpaceSaverMonoid, Awesome to find the approximate
most frequent and top K elements.- TopKMonoid- SGD, PriorityQueues, Histograms, etc.
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
SummingBird : Lamba in a box
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
Heard of Lambda Architecture ?
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
SummingBird
Same code for both batch and real time processing.
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
SummingBird
Same code, for both batch and real time processing.
But works only on Monoids.Uses Storehaus, as a mergeable store layer.
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
http://github.com/twitter/algebird
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
http://github.com/twitter/algebird
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
These slides :
http://bit.ly/1szncAZhttp://slidesha.re/1zhhXKU
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr
-Algebra for analytics by Oscar Boykin (Creator of Algebird)http://speakerdeck.com/johnynek/algebra-for-analytics
- Take a look into HLearn https://github.com/mikeizbicki/HLearn- Great intro into Algebird by Michael Nollhttp://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/
-Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure
- Probabilistic data structures for web analytics.http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
- http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure-for.html
- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
Links