+ All Categories
Home > Documents > from Bloom filters to Data pipelines

from Bloom filters to Data pipelines

Date post: 13-Feb-2017
Category:
Upload: dangdien
View: 233 times
Download: 2 times
Share this document with a friend
84
Building Data applications with Go 01 from Bloom filters to Data pipelines Sergii Khomenko, Data Scientist [email protected], @lc0d3r FOSDEM - January 31, 2016
Transcript
Page 1: from Bloom filters to Data pipelines

Building Data applications with Go

01

from Bloom filters to Data pipelines

Sergii Khomenko, Data [email protected], @lc0d3r

FOSDEM - January 31, 2016

Page 2: from Bloom filters to Data pipelines

Sergii Khomenko

2

Data scientist at one of the biggest fashion communities, Stylight.

Data analysis and visualisation hobbyist, working on problems not only in working time but in free time for fun and personal data visualisations.

First time faced Golang in ~ 2010. Fell in love with language channels and core concepts.

Speaker at Berlin Buzzwords 2014, ApacheCon Europe 2014, Puppet Camp London, Berlin Buzzwords 2015 , Tableau Conference on Tour, Budapest BI Forum 2015, Crunchsconf 2015 and others

Page 3: from Bloom filters to Data pipelines

3

Munich, Germany Founded on Apr 5, 2014 Gophers: 323

Page 4: from Bloom filters to Data pipelines

4

Page 5: from Bloom filters to Data pipelines

5

https://www.pinterest.com/pin/38351034303708696/

Page 6: from Bloom filters to Data pipelines

Profitable LeadsStylight provides its partners with high-quality leads enabling partner shops to leverage Stylight as a ROI positive traffic channel.

InspirationStylight offers

shoppable inspiration that

makes it easy to know what to

buy and how to style it.

Branding & ReachStylight offers a unique opportunity for brands to reach an audience that is actively looking for style online.

ShoppingStylight helps users search

and shop fashion and lifestyle products smarter across

hundreds of shops.

6

Stylight – Make Style HappenCore Target Group

Stylight help aspiring women between 18 and 35 to evolve their style through shoppable inspiration.

Page 7: from Bloom filters to Data pipelines

Stylight – acting on a global scale

Page 8: from Bloom filters to Data pipelines

Experienced & Ambitious Team

Innovative cross-functional organisation with flat hierarchy builds a unique team spirit.

• +200 employees• 40 PhDs/Engineers• 28 years average age

• 63% female• 23 nationalities• 0 suits

8

Page 9: from Bloom filters to Data pipelines

Agenda

9

P r o b a b i l i s t i c d a t a s t r u c t u r e s B l o o m f i l t e r s , C o u n t M i n o r H y p e r l o g l o g

O p e n S o u r c e s t a c k

A m a z o n A W S

G o o g l e C l o u d S e r v e r l e s s a r c h i t e c t u r e

Page 10: from Bloom filters to Data pipelines

The Nature of Data

10

Page 11: from Bloom filters to Data pipelines

Sources of data:

11

• Web tracking • Metrics tracking • Behaviour tracking

• Business intelligence ETL • Internal Services • ML tagging service

Page 12: from Bloom filters to Data pipelines

Access patterns

12

• Real-time • Nearly real-time • Daily batches

Page 13: from Bloom filters to Data pipelines

Probabilistic data structures

13

Page 14: from Bloom filters to Data pipelines

14

D a t a s t r u c t u r e s t h a t u s e d i f f e r e n t p r o b a b i l i s t i c

a p p r o a c h e s t o c o m p a c t l y s t o r e d a t a

Page 15: from Bloom filters to Data pipelines

15

Page 16: from Bloom filters to Data pipelines

16

Page 17: from Bloom filters to Data pipelines

Bloom filter

17

Approximate Membership

Page 18: from Bloom filters to Data pipelines

18

A B l o o m f i l t e r i s a s p a c e -e f f i c i e n t p r o b a b i l i s t i c d a t a

s t r u c t u r e , c o n c e i v e d b y B u r t o n H o w a r d B l o o m i n 1 9 7 0 , t h a t i s

u s e d t o t e s t w h e t h e r a n e l e m e n t i s a m e m b e r o f a s e t .

F a l s e p o s i t i v e m a t c h e s a r e p o s s i b l e , b u t f a l s e n e g a t i v e s a r e

n o t

Page 19: from Bloom filters to Data pipelines

19

A B l o o m f i l t e r i s a s p a c e - e f f i c i e n t p r o b a b i l i s t i c d a t a s t r u c t u r e , c o n c e i v e d b y B u r t o n H o w a r d

B l o o m i n 1 9 7 0 , t h a t i s u s e d t o t e s t w h e t h e r a n e l e m e n t i s a

m e m b e r o f a s e t . F a l s e p o s i t i v e m a t c h e s a r e p o s s i b l e , b u t

f a l s e n e g a t i v e s a r e n o t

Page 20: from Bloom filters to Data pipelines

20

• b i t a r r a y o f m b i t s . • k d i f f e r e n t h a s h f u n c t i o n s w i t h

a u n i f o r m r a n d o m d i s t r i b u t i o n

Page 21: from Bloom filters to Data pipelines

21

Page 22: from Bloom filters to Data pipelines

22

Page 23: from Bloom filters to Data pipelines

23

https://www.jasondavies.com/bloomfilter/

Page 24: from Bloom filters to Data pipelines

Size estimation

24

memory usage

hash functions n - estimated number of elementsp - false positive probabilitym - required bit array lengthExample:n=1,000,000FPR 10% ~= 4800000 Bit ~= 600 kByteFPR 0.1% ~= 14400000 Bit ~= 1.8 MByte

Page 25: from Bloom filters to Data pipelines

Use-cases

25

• Caches• Databases

• HBase• Cassandra

• Networking

https://github.com/willf/bloomhttps://github.com/reddragon/bloomfilter.go

https://github.com/seiflotfy/dlCBFhttps://github.com/patrickmn/go-bloom

https://github.com/armon/bloomdhttps://github.com/geetarista/go-bloomd

Page 26: from Bloom filters to Data pipelines

Extensions

26

• Cardinality estimate (increment counter when add a new) • Scalable Bloom filters (add another hash function on top)• Counting Bloom filters

• increment every time we see it

Page 27: from Bloom filters to Data pipelines

Count-Min

27

Frequency estimator

Page 28: from Bloom filters to Data pipelines

28

• m a t r i x o f w c o l u m n s a n d d r o w s • h a s h f u n c t i o n a s s o c i a t e d w i t h

e v e r y r o w

Page 29: from Bloom filters to Data pipelines

29

Page 30: from Bloom filters to Data pipelines

HyperLogLog

30

Cardinality estimator

Page 31: from Bloom filters to Data pipelines

31

H y p e r L o g L o g i s a n a l g o r i t h m f o r t h e c o u n t - d i s t i n c t p r o b l e m ,

a p p r o x i m a t i n g t h e n u m b e r o f d i s t i n c t e l e m e n t s i n a m u l t i s e t .

Page 32: from Bloom filters to Data pipelines

32

T h e H y p e r L o g L o g a l g o r i t h m i s a b l e t o e s t i m a t e c a r d i n a l i t i e s o f > 1 0 ^ 9 w i t h a t y p i c a l e r r o r

r a t e o f 2 % , u s i n g 1 . 5 k B o f m e m o r y [ 3 ]

Page 33: from Bloom filters to Data pipelines

33

hash(x) -> stream of bits {1,0,0,1,0,1..}• hash generates uniformly distributed values• every bit is independent

Hash function

Page 34: from Bloom filters to Data pipelines

34

p(first bit - 0) = 1/2p(second bit - 0) = 1/25 consecutive zeros - (1/2)^5

N consecutive zeros - (1/2)^N

Bit probability

Page 35: from Bloom filters to Data pipelines

35

p(first bit - 0) = 1/2p(second bit - 0) = 1/25 consecutive zeros - (1/2)^5

N consecutive zeros - (1/2)^NN = 32, Odds = 1/4294967296 -> Expected 4294967296 samples

Guessing bits

Page 36: from Bloom filters to Data pipelines

36

N = 32 = {1,0,0,0,0} = 6bitWith 6bits we can count 2^64Where the name is coming from Log(Log(64)) = 6

Storing bits

Page 37: from Bloom filters to Data pipelines

37

• Create m registers• Partition the bit stream

• first log(m) - register index• rest used for actual values

Multiple registers

Page 38: from Bloom filters to Data pipelines

38

HyperLogLog - add

Page 39: from Bloom filters to Data pipelines

39

• Given m registers• Estimate aggregated value

• Min? Max? Avg? Median?• Geometric/Harmonic mean!

• Estimate A*m*H

HyperLogLog - size

Page 40: from Bloom filters to Data pipelines

40http://content.research.neustar.biz/blog/hll.html

Page 41: from Bloom filters to Data pipelines

Use-cases

41

• Databases• Redis• PostgreSQL• Redshift• Impala• Hive

• Spark

https://github.com/clarkduvall/hyperlogloghttps://github.com/armon/hlld

Page 42: from Bloom filters to Data pipelines

42

I n c o m p u t i n g , a p i p e l i n e i s a s e t o f d a t a p r o c e s s i n g e l e m e n t s c o n n e c t e d i n s e r i e s , w h e r e t h e

o u t p u t o f o n e e l e m e n t i s t h e i n p u t o f t h e n e x t o n e .

Page 43: from Bloom filters to Data pipelines

Open Source Stack

43

Page 44: from Bloom filters to Data pipelines

44

http://lambda-architecture.net/

Page 45: from Bloom filters to Data pipelines

45

A p a c h e K a f k a i s p u b l i s h - s u b s c r i b e m e s s a g i n g r e t h o u g h t a s a d i s t r i b u t e d c o m m i t l o g .

Page 46: from Bloom filters to Data pipelines

46

Page 47: from Bloom filters to Data pipelines

47

Page 48: from Bloom filters to Data pipelines

48

Libraries• Sarama is an MIT-licensed Go client library for Apache Kafka version 0.8 (and later)https://github.com/Shopify/sarama

Go Kafka Clienthttps://github.com/elodina/go_kafka_client

Page 49: from Bloom filters to Data pipelines

49

producer, err := NewAsyncProducer([]string{"localhost:9092"}, nil)if err != nil { panic(err)}

defer func() { if err := producer.Close(); err != nil { log.Fatalln(err) }}()

Page 50: from Bloom filters to Data pipelines

50

var enqueued, errors intProducerLoop:for { select { case producer.Input() <- &ProducerMessage{Topic: "my_topic", Key: nil, Value: StringEncoder("testing 123")}: enqueued++

case err := <-producer.Errors(): log.Println("Failed to produce message", err) errors++ case <-signals: break ProducerLoop }}

log.Printf("Enqueued: %d; errors: %d\n", enqueued, errors)

Page 51: from Bloom filters to Data pipelines

51http://www.ipponusa.com/wp-content/uploads/2014/10/spark-architecture.jpg

dmrgo is a Go library for writing map/reduce jobs.https://github.com/dgryski/dmrgo

Page 52: from Bloom filters to Data pipelines

Results

52

• Scalable • Flexible

• High costs of maintenance • Not so easy to setup

Page 53: from Bloom filters to Data pipelines

53

A p r o g r a m m i n g l a n g u a g e i s l o w l e v e l w h e n i t s p r o g r a m s r e q u i r e

a t t e n t i o n t o t h e i r r e l e v a n t .

Alan Jay Perlis / Epigrams on Programming

Page 54: from Bloom filters to Data pipelines

Amazon AWS

54

Page 55: from Bloom filters to Data pipelines

Kinesis Streams

Page 56: from Bloom filters to Data pipelines

56

Page 57: from Bloom filters to Data pipelines

57

Page 58: from Bloom filters to Data pipelines

58

Libraries

• AWS SDK for Gohttps://github.com/aws/aws-sdk-go

Page 59: from Bloom filters to Data pipelines

59

Page 60: from Bloom filters to Data pipelines

60

Page 61: from Bloom filters to Data pipelines

Kinesis Firehose

Page 62: from Bloom filters to Data pipelines

Kinesis Analytics

Page 63: from Bloom filters to Data pipelines

63

custom unificationpipeline

ProductProcessing

BusinessIntelligence

ML/TaggingProduct events

variety of event types and structures

Page 64: from Bloom filters to Data pipelines

Google Cloud

64

Page 65: from Bloom filters to Data pipelines

65

Page 66: from Bloom filters to Data pipelines

66

Page 67: from Bloom filters to Data pipelines

67

Libraries

• Google APIs Client Library for Gohttps://github.com/GoogleCloudPlatform/gcloud-golang

Page 68: from Bloom filters to Data pipelines

68

Page 69: from Bloom filters to Data pipelines

69

Page 70: from Bloom filters to Data pipelines
Page 71: from Bloom filters to Data pipelines

71

Page 72: from Bloom filters to Data pipelines

Serverless architecture

72

Page 73: from Bloom filters to Data pipelines

73

Page 74: from Bloom filters to Data pipelines

74

Page 75: from Bloom filters to Data pipelines

75

Page 76: from Bloom filters to Data pipelines

76

Page 77: from Bloom filters to Data pipelines

77

Page 78: from Bloom filters to Data pipelines

78

Page 79: from Bloom filters to Data pipelines

79

Page 80: from Bloom filters to Data pipelines

80

Possibilities

• all Lambdas in one place with version control• integration tests with real events• proper CI/CD setup

Page 81: from Bloom filters to Data pipelines

81

Page 83: from Bloom filters to Data pipelines

Related links

83

1. Burton H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970

2. Interactive visualisation: Bloom Filters

3. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

4. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm

5. HyperLogLog — Cornerstone of a Big Data Infrastructure

6. Armon Dadgar on Bloom Filters and HyperLogLog

Page 84: from Bloom filters to Data pipelines

Related links

84

7. https://github.com/willf/bloom

8. Google’s Cloud Pub/Sub Real-Time Messaging Service Is Now In Public Beta

9. Streaming Data Processing with Amazon Kinesis and AWS Lambda

10. Google Cloud Dataflow Two Worlds Become a Much Better One

11. https://github.com/apex/apex


Recommended