Date post: | 27-Jan-2017 |
Category: |
Engineering |
Upload: | kendrick-lo |
View: | 702 times |
Download: | 1 times |
Count me once, count me fast!Probabilistic methods in real-time streaming
(Hyperloglog, Bloom filters)
Kendrick LoInsight Data Engineering, NYCSummer 2016
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Unique User IDUnique User IDUnique User IDUnique User ID...
...?
real-time viewing data
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Ad ID Unique User ID
Time stamp
Unique User IDUnique User IDUnique User IDUnique User ID...
...?
13 MB100 million
uniques
bitmap(for exact counting)
4 KBbillions of uniques
hyperloglog
real-time viewing data
Hyperloglog
Count-distinct problem (a.k.a. cardinality estimation problem)
● counting unique elements in a data stream with repeated elements
● calculates an approximate number○ typical error purported to be
less than < 2%
What it can’t do:
● give an exact count● track frequency of
occurrence● confirm whether a certain
element was seen
Hyperloglog - a probabilistic methodGeneral Idea: Count leading zeros in a randomly generated binary number
Given a random number,what is the probability of seeing…?
1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...
Hyperloglog - a probabilistic method
1 x x x x x x x x… → 0.5 (1 out of every 2)0 1 x x x x x x x… → 0.25 (1 out of every 4)0 0 1 x x x x x x… → 0.125 (1 out of every 8)…0 0 0 0 0 0 1 x x… → 0.008 (1 out of every 128)...
Question:
I have a list of N unique numbers. The one with the longest string
of leading zeros is
0 0 0 0 0 0 1 x x…
What is N?
General Idea: Count leading zeros in a randomly generated binary number
Given a random number,what is the probability of seeing…?
HyperloglogID
IDIDID
ID
6
=> 128 unique viewers
5 6 7 4 6 8... ...(harmonic) MEAN: 6
IDIDID
Pipeline
Ad ID
Unique User ID
Gender
Age segments
Time stamp
Algebird
4 x m4.large
1 sec mini-batches
Pushed 1 billion records with unique user IDs
● Throughput can reach an average of 5M records/min
● Streams of <1M records processed within a minute
● After >1M uniques, delays accumulate causing system instability when using sets
Extension: counting unique viewers in a subgroup
● Associating segments with user IDs○ Challenge: Can we avoid database accesses when
processing data in real-time?○ Bloom filter: another fixed-size probabilistic data
structure that trades off (tunable) accuracy for size e.g. Bloom filter + Hyperloglog count males error: 1.2%
○ needed to overcome challenges in combining aspects of Spark (batch) and Spark Streaming
Ad ID Unique User ID Gender Age segment
(e.g. 18-34)Time
stampSample record
About me
Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
About me
Master of Science, Harvard University Computational Science and Engineering (graduated May 2016)
J.D. / MBA, University of Toronto
Bachelor of Applied Science, University of Toronto Engineering Science (Computer)
Thank you for listening!
appendix
[Set structures]
[HLL structures]
Results: error rate in counts
● Error < 2% for subgroups; slightly higher for main group
● Error for intersection calculation (purple) tends to be higher on average
Use cases
● Advertising○ ad viewership, website views, television viewership, app engagement, etc.
● Any application where you would want to count a large number of unique things fast
○ stock trades, network traffic, twitter responses, election data, real-time voting data, etc.
● Well suited to real-time analytics○ intermediate state of HLL structure provides for a running count○ trivially parallelizable
Ad ID Unique User ID Gender Age segment
(e.g. 18-34)Time
stampSample record
Future exploration
● Associating segments with user IDs○ quantifying incremental error associated with introduction of
Bloom filters● Apache Storm versus Spark
○ Does Storm (a “pure” streaming technology) perform much better?
● Spark DataFrames API○ seemed to introduce significant delay: would like to quantify this
Bloom Filters● Experiment with 1 million records
○ Employed 2 bloom filters (1 MB each), one for each segment (male, 18-34) to store segment data to be matched with incoming user IDs, continued processing with Hyperloglog
○ estimated error for hyperloglog: 2%; estimated error for bloom filter: 3%
● Actual error: ○ Bloom filter + Hyperloglog: count males: 1.2%; count 18-34: 0.6%; intersection: 5.9%○ Hyperloglog only: count males: 1.4%; count 18-34: 0.7%; intersection: 5.6%
● Time to process:○ Bloom filter + Hyperloglog: 17s (+55%)○ Hyperloglog only: 11s
Bloom Filters
Source: Wikipedia
Tuning Probabilistic StructuresHyperloglog(source: Twitter Algebird source code: HyperLogLog.scala)
Bloom Filters(source: https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
e.g. n = 1 M (capacity) p = 0.03 (error)
=> k = 5 (# of hash functions) => m = 891 kB