+ All Categories
Home > Software > Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Date post: 02-Jul-2015
Category:
Upload: samir-bessalah
View: 2,028 times
Download: 6 times
Share this document with a friend
Description:
Algebird; abstract algebra for analytics. Devoxx 2014. Antwerp. Belgium
51
Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr Algebird Abstract Algebra for Analytics Sam BESSALAH @samklr
Transcript
Page 1: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop @samklr

Algebird

Abstract Algebrafor

Analytics

Sam BESSALAH

@samklr

Page 2: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr

Page 3: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr

Page 4: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr

Page 5: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr

Abstract Algebra

Page 6: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark @samklr

From WikiPedia

Page 7: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Algebraic Structure

“ Set of values, coupled with one ormore finite operations,and a set oflaws those operations must obey. “

Page 8: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Algebraic Structure

“ Set of values, coupled with one or morefinite operations, and a set of laws thoseoperations must obey. “

e.g Sum, Magma, Semigroup, Groups, Monoid,Abelian Group, Semi Lattices, Rings, Monads,etc.

Page 9: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Semigroup

Semigroup Law :

(x <> y) <> z = x <> (y <> z)(associativity)

Page 10: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Semigroup

Semigroup Law :(x <> y) <> z = x <> (y <> z)

(associativity)

trait Semigroup[T] {def aggregate(x : T, y : T) : T

}

Page 11: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Monoids

Monoid Laws :(x <> y) <> z = x <> (y <> z)

(associativity)

identity <> x = xx <> identity = x

(identity)

Page 12: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Monoids

Monoid Laws :(x <> y) <> z = x <> (y <> z)

(associativity)identity <> x = xx <> identity = x

(identiy / zero)

trait Monoid[T] {def identity : Tdef aggregate (x, y) : T

}

Page 13: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Monoids

Monoid Laws :(x <> y) <> z = x <> (y <> z)

(associativity)identity <> x = xx <> identity = x

trait Monoid[T] extends Semigroup[T]{

def identity : T

}

Page 14: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Groups

Group Laws:(x <> y) <> z = x <> (y <> z)

(associativity)

identity <> x = xx <> identity = x

(identity)

x <> inverse x = identityinverse x <> x = identity

(invertibility)

Page 15: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

GroupsGroup Laws

(x <> y) <> z = x <> (y <> z)identity <> x = xx <> identity = x

x <> inverse x = identityinverse x <> x = identity

trait Group[T] extends Monoid[T]{def inverse (v : T) :T

}

Page 16: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Many More

- Abelian groups (Commutative Sets)- Rings- Semi Lattices- Ordered Semigroups- Fields ..

Many of those are in Algebird ….

Page 17: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Examples- (a min b) min c = a (b min c) with Int.- a max ( b max c) = (a max b) max c **- a or (b or c) = (a or b) or c- a and (b and c) = (a and b) and c- int addition- set union- harmonic sum- Integer mean- Priority queue

Page 18: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Page 19: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Why do we need those algebraic structures ?

Page 20: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

We want to :

- Build scalable analytics systems

- Leverage distributed computing to perform aggregation on really large data sets.

- A lot of operations in analytics are just sorting and counting at the end of the day

Page 21: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Distributed Computing → Parallellism

Page 22: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Distributed Computing → Parallellism

Associativity → enables parallelism

Page 23: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Page 24: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Distributed Computing → Parallellism

Associativity enables parallelism

Identity means we can ignore some data

Commutativity helps us ignore order

Page 25: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Typical Map Reduce ...

Page 26: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Finding Top-K Elements in Scalding ...

class TopKJob(args : Args) extends Job (args) {Tsv ( args(‘input’), visitScheme)

.filter (. ..)

.leftJoinWithTiny ( … )

.filter ( … )

.groupBy( ‘fieldOne){ _.sortWithTake (visitScheme -> top } (biggerSale)

.write(Tsv(...) ) }

Page 27: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

.sortWithTake( … )

Looking into .sortWithTake in Scalding, there’s one nice thing :

class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]

Page 28: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]

Let’s take a look :PQ1 : 55, 45, 21, 3PQ2: 100, 80, 40, 3top-4 (PQ1 U PQ2 ): 100, 80, 55, 45

Priority Queue :Can be emptyTwo Priority Queues can be “added” in any orderAssociative + Commutative

Page 29: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]

Let’s take a look :PQ1 : 55, 45, 21, 3PQ2: 100, 80, 40, 3top-4 (PQ1 U PQ2 ): 100, 80, 55, 45

Priority Queue :Can be emptyTwo Priority Queues can be “added” in any orderAssociative + Commutative

Makes Scalding go fast,

by doing sorting, filtering and extracting in one single “map” step.

Page 30: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Stream Mining Challenges

- Update predictions after each observation- Single pass : can’t read old data or replay

the stream- Full size of the stream often unknown- Limited time for computation per

observation- O(1) memory size

Page 31: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Stream Mining Challenges

http://radar.oreilly.com/2013/10/stream-mining-essentials.html

Page 32: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Tradeoff : Space and speed over accuracy.

Page 33: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

Room 4 #Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Tradeoff : Space and speed over accuracy.

use sketches.

Page 34: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Sketches

Probabilistic data structures that store a summary(hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of thetime, sublinear algorithmic properties.

E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes

Page 35: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Bloom filters

Approximate data structure for set membershipBehaves like an approximate set

BloomFilter.contains(x) => NO | MaybeP(False Positive) > 0P(False Negative) = 0

Page 36: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Internally :Bit Array of fixed size

add(x) : for all element i, b[h(x,i)]=1contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

(Boolean AND => associative)Both are associative => BF can be designed as a Monoid

Page 37: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Bloom filters

import com.twitter.algebird._import com.twitter.algebird.Operators._// generate 2 listsval A = (1 to 300).toList// Generate a Bloomfilterval NUM_HASHES = 6val WIDTH = 6000 // bitsval SEED = 1implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)// approximate set with bloomfilterval A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _)val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…)

Page 38: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Count Min Sketch

Gives an approximation of the number of occurrences of an element in a set.

Page 39: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Count Min Sketch

Count min sketchAdding an element is a numerical additionQuerying uses a MIN function.Both are associative.

useful for detecting heavy hitters, topK, LSH

We have in Algebird :

Page 40: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

HyperLogLog

Popular sketch for cardinality estimtion.Gives within a probilistic distribution of an error the number of distinct values in a data set.

HLL.size = Approx[Number]

IntuitionLong runs of trailings 0 in a random bitschain are rareBut the more bit chains you look at, the more likely you are to find a long oneThe longest run of trailing 0-bits seen can bean estimator of the number of unique bit chainsobserved.

Page 41: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Adding an element uses a Max and Sum function.Both are associative and Monoids. (Max is an ordered

semigroup in Algebird really)

Querying for an element uses an harmonic meanwhich is a Monoid.

In Algebird :

Page 42: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Many More juicy sketches ...

- MinHashes to compute Jaccard similarity- QTree for quantiles estimation. Neat for anomaly

detection.- SpaceSaverMonoid, Awesome to find the approximate

most frequent and top K elements.- TopKMonoid- SGD, PriorityQueues, Histograms, etc.

Page 43: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

SummingBird : Lamba in a box

Page 44: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Heard of Lambda Architecture ?

Page 45: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

SummingBird

Same code for both batch and real time processing.

Page 46: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

SummingBird

Same code, for both batch and real time processing.

But works only on Monoids.Uses Storehaus, as a mergeable store layer.

Page 47: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

http://github.com/twitter/algebird

Page 48: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

http://github.com/twitter/algebird

Page 49: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

These slides :

http://bit.ly/1szncAZhttp://slidesha.re/1zhhXKU

Page 50: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

Page 51: Algebird : Abstract Algebra for big data analytics. Devoxx 2014

#Devoxx #algebird #scalding #monoid #hadoop #spark@samklr

-Algebra for analytics by Oscar Boykin (Creator of Algebird)http://speakerdeck.com/johnynek/algebra-for-analytics

- Take a look into HLearn https://github.com/mikeizbicki/HLearn- Great intro into Algebird by Michael Nollhttp://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-monad-for-large-scala-data-analytics/

-Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure

- Probabilistic data structures for web analytics.http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

- http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-structure-for.html

- http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

Links


Recommended