Date post: | 15-May-2015 |
Category: |
Technology |
Upload: | nati-shalom |
View: | 1,750 times |
Download: | 0 times |
Real Time Analytics for Big DataA Twitter Inspired Case Study
@natishalom
Big Data Predictions
2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time” (< few Seconds)
Reasonably Quick (seconds - minutes)
Batch (hours/days)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10
This is what we’re here to discuss
This is what we’re here to discuss
Challenge – Word Count
Word:Count
Tweets
CountCount??® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11
• Hottest topics• URL mentions• etc.
• Hottest topics• URL mentions• etc.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12
URL Mentions – Here’s One Use Case
It takes a week for users to
send 1 billion Tweets.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
On average,
140 million tweets get sent every day.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved14
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
The highest throughput to date is
6,939 tweets/sec.® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
460,000 new accounts
are created daily.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16
Twitter in Numbers (March 2011)
Source: http://blog.twitter.com/2011/03/numbers.html
5% of the users generate
75% of the content.
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time
Aggregate counters for each word A few 10s of thousands of words (or hundreds of
thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant
Analyze the Problem
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18
Key Elements in Real Time Big Data Analytics
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19
Sharding (Partitioning)
Tokenizer1
Tokenizer1 Filterer 1Filterer 1
Tokenizer2
Tokenizer2 Filterer 2Filterer 2
Tokenizer 3
Tokenizer 3 Filterer 3Filterer 3
Tokenizer n
Tokenizer n Filterer nFilterer n
Counter Updater 1Counter
Updater 1
Counter Updater 2Counter
Updater 2
Counter Updater 3Counter
Updater 3
Counter Updater nCounter
Updater n
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22
Use EDA (Event Driven Architecture)
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23
Putting it all together
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24
Know Your Toolset
Writing your own twitter analytics: http://ht.ly/d8j4I Detailed blog post
http://bit.ly/gs-bigdata-analytics Twitter in numbers:
http://blog.twitter.com/2011/03/numbers.html Twitter Storm:
http://bit.ly/twitter-storm Apache S4
http://incubator.apache.org/s4/
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25
References
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26