+ All Categories
Transcript
Page 1: Real Time Analytics for Big Data a Twiiter Case Study

Real Time Analytics for Big DataA Twitter Inspired Case Study

@natishalom

Page 2: Real Time Analytics for Big Data a Twiiter Case Study

Big Data Predictions

2® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Page 3: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved3

The Two Vs of Big Data

Velocity Volume

Page 4: Real Time Analytics for Big Data a Twiiter Case Study

We’re Living in a Real Time World…Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved4

Page 5: Real Time Analytics for Big Data a Twiiter Case Study

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved5

Page 6: Real Time Analytics for Big Data a Twiiter Case Study

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved6

Page 7: Real Time Analytics for Big Data a Twiiter Case Study

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved7

Page 8: Real Time Analytics for Big Data a Twiiter Case Study

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved8

Page 9: Real Time Analytics for Big Data a Twiiter Case Study

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved9

Page 10: Real Time Analytics for Big Data a Twiiter Case Study

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved10

This is what we’re here to discuss

This is what we’re here to discuss

Page 11: Real Time Analytics for Big Data a Twiiter Case Study

Challenge – Word Count

Word:Count

Tweets

CountCount??® Copyright 2011 Gigaspaces Ltd. All Rights Reserved11

• Hottest topics• URL mentions• etc.

• Hottest topics• URL mentions• etc.

Page 12: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved12

URL Mentions – Here’s One Use Case

Page 13: Real Time Analytics for Big Data a Twiiter Case Study

It takes a week for users to

send 1 billion Tweets.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved13

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 14: Real Time Analytics for Big Data a Twiiter Case Study

On average,

140 million tweets get sent every day.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved14

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 15: Real Time Analytics for Big Data a Twiiter Case Study

The highest throughput to date is

6,939 tweets/sec.® Copyright 2011 Gigaspaces Ltd. All Rights Reserved15

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 16: Real Time Analytics for Big Data a Twiiter Case Study

460,000 new accounts

are created daily.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved16

Twitter in Numbers (March 2011)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 17: Real Time Analytics for Big Data a Twiiter Case Study

5% of the users generate

75% of the content.

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved17

Twitter in Numbers

Source: http://www.sysomos.com/insidetwitter/

Page 18: Real Time Analytics for Big Data a Twiiter Case Study

(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time

Aggregate counters for each word A few 10s of thousands of words (or hundreds of

thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant

Analyze the Problem

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved18

Page 19: Real Time Analytics for Big Data a Twiiter Case Study

Key Elements in Real Time Big Data Analytics

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved19

Page 20: Real Time Analytics for Big Data a Twiiter Case Study

Sharding (Partitioning)

Tokenizer1

Tokenizer1 Filterer 1Filterer 1

Tokenizer2

Tokenizer2 Filterer 2Filterer 2

Tokenizer 3

Tokenizer 3 Filterer 3Filterer 3

Tokenizer n

Tokenizer n Filterer nFilterer n

Counter Updater 1Counter

Updater 1

Counter Updater 2Counter

Updater 2

Counter Updater 3Counter

Updater 3

Counter Updater nCounter

Updater n

Page 21: Real Time Analytics for Big Data a Twiiter Case Study

Keep Things In Memory

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 22: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved22

Use EDA (Event Driven Architecture)

Page 23: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved23

Putting it all together

Page 24: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved24

Know Your Toolset

Page 25: Real Time Analytics for Big Data a Twiiter Case Study

Writing your own twitter analytics: http://ht.ly/d8j4I Detailed blog post

http://bit.ly/gs-bigdata-analytics Twitter in numbers:

http://blog.twitter.com/2011/03/numbers.html Twitter Storm:

http://bit.ly/twitter-storm Apache S4

http://incubator.apache.org/s4/

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved25

References

Page 26: Real Time Analytics for Big Data a Twiiter Case Study

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved26


Top Related