+ All Categories
Home > Technology > Real Time Analytics for Big Data a Twiiter Case Study

Real Time Analytics for Big Data a Twiiter Case Study

Date post: 15-May-2015
Category:
Upload: nati-shalom
View: 1,750 times
Download: 0 times
Share this document with a friend
Description:
Learn how to build a Twitter-like analytics system, designed to meet real time needs, in a simple way. Using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database. Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising. In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling streaming or real time analysis and processing is born to handle the needs of large-scale location-aware mobile, social and sensor use. Do we want to limit ourselves to just these use cases? Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs. In this session we will Review the common patterns and architecture that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memroy Data Grid for Big Data event processing, and NoSQL database such as Cassandra or Hbase for handling the managing the historical data.
Popular Tags:
26
Real Time Analytics for Big Data A Twitter Inspired Case Study @natishalom
Transcript
Page 1: Real Time Analytics for Big Data a Twiiter Case Study

Real Time Analytics for Big DataA Twitter Inspired Case Study

@natishalom

Page 2: Real Time Analytics for Big Data a Twiiter Case Study

Big Data Predictions

2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 3: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3

The Two Vs of Big Data

Velocity Volume

Page 4: Real Time Analytics for Big Data a Twiiter Case Study

We’re Living in a Real Time World…Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Page 5: Real Time Analytics for Big Data a Twiiter Case Study

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5

Page 6: Real Time Analytics for Big Data a Twiiter Case Study

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6

Page 7: Real Time Analytics for Big Data a Twiiter Case Study

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

Page 8: Real Time Analytics for Big Data a Twiiter Case Study

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

Page 9: Real Time Analytics for Big Data a Twiiter Case Study

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

Page 10: Real Time Analytics for Big Data a Twiiter Case Study

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

This is what we’re here to discuss

This is what we’re here to discuss

Page 11: Real Time Analytics for Big Data a Twiiter Case Study

Challenge – Word Count

Word:Count

Tweets

CountCount??® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11

• Hottest topics• URL mentions• etc.

• Hottest topics• URL mentions• etc.

Page 12: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12

URL Mentions – Here’s One Use Case

Page 13: Real Time Analytics for Big Data a Twiiter Case Study

It takes a week for users to

send 1 billion Tweets.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 14: Real Time Analytics for Big Data a Twiiter Case Study

On average,

140 million tweets get sent every day.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved14

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 15: Real Time Analytics for Big Data a Twiiter Case Study

The highest throughput to date is

6,939 tweets/sec.® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 16: Real Time Analytics for Big Data a Twiiter Case Study

460,000 new accounts

are created daily.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 17: Real Time Analytics for Big Data a Twiiter Case Study

5% of the users generate

75% of the content.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17

Twitter in Numbers

Source: http://www.sysomos.com/insidetwitter/

Page 18: Real Time Analytics for Big Data a Twiiter Case Study

(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time

Aggregate counters for each word A few 10s of thousands of words (or hundreds of

thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant

Analyze the Problem

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18

Page 19: Real Time Analytics for Big Data a Twiiter Case Study

Key Elements in Real Time Big Data Analytics

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19

Page 20: Real Time Analytics for Big Data a Twiiter Case Study

Sharding (Partitioning)

Tokenizer1

Tokenizer1 Filterer 1Filterer 1

Tokenizer2

Tokenizer2 Filterer 2Filterer 2

Tokenizer 3

Tokenizer 3 Filterer 3Filterer 3

Tokenizer n

Tokenizer n Filterer nFilterer n

Counter Updater 1Counter

Updater 1

Counter Updater 2Counter

Updater 2

Counter Updater 3Counter

Updater 3

Counter Updater nCounter

Updater n

Page 21: Real Time Analytics for Big Data a Twiiter Case Study

Keep Things In Memory

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 22: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22

Use EDA (Event Driven Architecture)

Page 23: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23

Putting it all together

Page 24: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24

Know Your Toolset

Page 25: Real Time Analytics for Big Data a Twiiter Case Study

Writing your own twitter analytics: http://ht.ly/d8j4I Detailed blog post

http://bit.ly/gs-bigdata-analytics Twitter in numbers:

http://blog.twitter.com/2011/03/numbers.html Twitter Storm:

http://bit.ly/twitter-storm Apache S4

http://incubator.apache.org/s4/

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25

References

Page 26: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26


Recommended