Date post: | 27-Jun-2015 |
Category: |
Technology |
Upload: | dfilppi |
View: | 574 times |
Download: | 1 times |
Real Time Analytics for Big Data – Lessons from Twitter (and beyond)..
DeWayne Filppi@dfilppi
Big Data Predictions
“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time” (< few Seconds)
Reasonably Quick (seconds - minutes)
Batch (hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what we’re here to discuss
TWITTER REAL-TIMEANALYTICS SYSTEM
11
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
It takes a week for users to
send 3 billion Tweets.
Twitter in Numbers (Jan 2013)
Source: http://blog.twitter.com/2011/03/numbers.html
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
On average,
500 million tweets get sent every day.
Twitter in Numbers (Jan 2013)
Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
The highest throughput to date is
33,388 tweets/sec.
Twitter in Numbers (Jan 2013)
http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
1,000,000 new accounts
are created daily.
Twitter in Numbers (March 2011)
Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b33589
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
5% of the users generate
75% of the content.
Twitter in Numbers
Source: http://www.sysomos.com/insidetwitter/
(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time
Aggregate counters for each word A few 10s of thousands of words (or hundreds of
thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant
Analyze the Problem
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Key Elements in Real Time Big Data Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Sharding (Partitioning)
Tokenizer1 Filterer 1
Tokenizer2 Filterer 2
Tokenizer 3 Filterer 3
Tokenizer n Filterer n
Counter Updater 1
Counter Updater 2
Counter Updater 3
Counter Updater n
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Use EDA (Event Driven Architecture)
TokenizerRaw FiltererTokenized
Counter/
AggregatorFiltered
Twitter Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Twitter Storm With Hadoop
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
Storm Overview
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
Streams Unbounded sequence of tuples
Spouts Source of streams (Queues)
Bolts Functions, Filters, Joins, Aggregations
Topologies
Storm ConceptsSpouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
• Hottest topics• URL mentions• etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using
batching. Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets, events,whatever….
XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach Advantage: Minimal
“impedance mismatch” between layers.– Both NoSQL cluster
technologies, with similar advantages
Grid layer serves as an in memory cache for interactive requests.
Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability.
In Memory Compute Cluster
NoSQL Cluster
...
Raw
Eve
nt S
trea
m
Raw
Eve
nt S
trea
m
Raw
Eve
nt S
trea
m
Raw And Derived Events
Rep
orti
ng E
ngin
e
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Simplified Architecture
Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable
layer Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
Take Aways A data grid can serve different needs for big data analytics:
Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state
Provide a general purpose analytics platform– Roll your own
Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analy
tics-with-storm Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
Twitter Storm: http://storm-project.net
XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved35