Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr

Algebird

Abstract Algebrafor

Analytics

Sam BESSALAH

@samklr

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr




Abstract Algebra


From WikiPedia

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Algebraic Structure

“ Set of values, coupled with one ormore finite operations,and a set oflaws those operations must obey. “


Algebraic Structure

“ Set of values, coupled with one or morefinite operations, and a set of laws thoseoperations must obey. “

e.g Sum, Magma, Semigroup, Groups, Monoid,Abelian Group, Semi Lattices, Rings, Monads,etc.


Semigroup

Semigroup Law :

(x <> y) <> z = x <> (y <> z)(associativity)


Semigroup

Semigroup Law :(x <> y) <> z = x <> (y <> z)

(associativity)

trait Semigroup[T] {def aggregate(x : T, y : T) : T

}


Monoids

Monoid Laws :(x <> y) <> z = x <> (y <> z)

(associativity)

identity <> x = xx <> identity = x

(identity)


Monoids


(associativity)identity <> x = xx <> identity = x

(identiy / zero)

trait Monoid[T] {def identity : Tdef aggregate (x, y) : T

}


Monoids


(associativity)identity <> x = xx <> identity = x

trait Monoid[T] extends Semigroup[T]{

def identity : T

}


Groups

Group Laws:(x <> y) <> z = x <> (y <> z)

(associativity)

identity <> x = xx <> identity = x

(identity)

x <> inverse x = identityinverse x <> x = identity

(invertibility)


GroupsGroup Laws

(x <> y) <> z = x <> (y <> z)identity <> x = xx <> identity = x

x <> inverse x = identityinverse x <> x = identity

trait Group[T] extends Monoid[T]{def inverse (v : T) :T

}


Many More

- Abelian groups (Commutative Sets)- Rings- Semi Lattices- Ordered Semigroups- Fields ..

Many of those are in Algebird ….


Examples- (a min b) min c = a (b min c) with Int.- a max ( b max c) = (a max b) max c **- a or (b or c) = (a or b) or c- a and (b and c) = (a and b) and c- int addition- set union- harmonic sum- Integer mean- Priority queue



Why do we need those algebraic structures ?


We want to :

- Build scalable analytics systems

- Leverage distributed computing to perform aggregation on really large data sets.

- A lot of operations in analytics are just sorting and counting at the end of the day


Distributed Computing → Parallellism



Associativity → enables parallelism




Associativity enables parallelism

Identity means we can ignore some data

Commutativity helps us ignore order


Typical Map Reduce ...


Finding Top-K Elements in Scalding ...

class TopKJob(args : Args) extends Job (args) {Tsv ( args(‘input’), visitScheme)

.filter (. ..)

.leftJoinWithTiny ( … )

.filter ( … )

.groupBy( ‘fieldOne){ _.sortWithTake (visitScheme -> top } (biggerSale)

.write(Tsv(...) ) }


.sortWithTake( … )

Looking into .sortWithTake in Scalding, there’s one nice thing :

class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]



Let’s take a look :PQ1 : 55, 45, 21, 3PQ2: 100, 80, 40, 3top-4 (PQ1 U PQ2 ): 100, 80, 55, 45

Priority Queue :Can be emptyTwo Priority Queues can be “added” in any orderAssociative + Commutative



Let’s take a look :PQ1 : 55, 45, 21, 3PQ2: 100, 80, 40, 3top-4 (PQ1 U PQ2 ): 100, 80, 55, 45

Priority Queue :Can be emptyTwo Priority Queues can be “added” in any orderAssociative + Commutative

Makes Scalding go fast,

by doing sorting, filtering and extracting in one single “map” step.


Stream Mining Challenges

- Update predictions after each observation- Single pass : can’t read old data or replay

the stream- Full size of the stream often unknown- Limited time for computation per

observation- O(1) memory size


Stream Mining Challenges

http://radar.oreilly.com/2013/10/stream-mining-essentials.html

http://radar.oreilly.com/2013/10/stream-mining-essentials.html


Tradeoff : Space and speed over accuracy.


Tradeoff : Space and speed over accuracy.

use sketches.

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Sketches

Probabilistic data structures that store a summary(hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of thetime, sublinear algorithmic properties.

E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes


Bloom filters

Approximate data structure for set membershipBehaves like an approximate set

BloomFilter.contains(x) => NO | MaybeP(False Positive) > 0P(False Negative) = 0


Internally :Bit Array of fixed size

add(x) : for all element i, b[h(x,i)]=1contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

(Boolean AND => associative)Both are associative => BF can be designed as a Monoid


Bloom filters

import com.twitter.algebird._import com.twitter.algebird.Operators._// generate 2 listsval A = (1 to 300).toList// Generate a Bloomfilterval NUM_HASHES = 6val WIDTH = 6000 // bitsval SEED = 1implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)// approximate set with bloomfilterval A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _)val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…)


Count Min Sketch

Gives an approximation of the number of occurrences of an element in a set.


Count Min Sketch

Count min sketchAdding an element is a numerical additionQuerying uses a MIN function.Both are associative.

useful for detecting heavy hitters, topK, LSH

We have in Algebird :


HyperLogLog

Popular sketch for cardinality estimtion.Gives within a probilistic distribution of an error the number of distinct values in a data set.

HLL.size = Approx[Number]

IntuitionLong runs of trailings 0 in a random bitschain are rareBut the more bit chains you look at, the more likely you are to find a long oneThe longest run of trailing 0-bits seen can bean estimator of the number of unique bit chainsobserved.


Adding an element uses a Max and Sum function.Both are associative and Monoids. (Max is an ordered

semigroup in Algebird really)

Querying for an element uses an harmonic meanwhich is a Monoid.

In Algebird :


Many More juicy sketches ...

- MinHashes to compute Jaccard similarity- QTree for quantiles estimation. Neat for anomaly

detection.- SpaceSaverMonoid, Awesome to find the approximate

most frequent and top K elements.- TopKMonoid- SGD, PriorityQueues, Histograms, etc.


SummingBird : Lamba in a box


Heard of Lambda Architecture ?


SummingBird

Same code for both batch and real time processing.


SummingBird

Same code, for both batch and real time processing.

But works only on Monoids.Uses Storehaus, as a mergeable store layer.


http://github.com/twitter/algebird


http://github.com/twitter/algebird


These slides :

http://bit.ly/1szncAZhttp://slidesha.re/1zhhXKU

http://bit.ly/1szncAZ

http://slidesha.re/1zhhXKU



-Algebra for analytics by Oscar Boykin (Creator of Algebird)http://speakerdeck.com/johnynek/algebra-for-analytics

- Take a look into HLearn https://github.com/mikeizbicki/HLearn- Great intro into Algebird by Michael Nollhttp://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

-Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure

- Probabilistic data structures for web analytics.http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

- http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure-for.html

- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Links

http://speakerdeck.com/johnynek/algebra-for-analytics

https://github.com/mikeizbicki/HLearn

http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure-for.html

http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Date post:	02-Jul-2015
Category:	Software
Upload:	samir-bessalah
View:	2,028 times
Download:	6 times

Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Software