+ All Categories
Home > Technology > Bigdata analytics-twitter

Bigdata analytics-twitter

Date post: 27-Jun-2015
Category:
Upload: dfilppi
View: 574 times
Download: 1 times
Share this document with a friend
Description:
Description of the creation of real time big data analytics by the combination of in-memory computing technology with big data storage technology. Twitter analytics used as the entry point and example, then describing how Storm functions, the combination of a data grid with Storm for ultimate performance, and a real world example of a production big data real time analytical system combining GigaSpaces XAP, Apache Cassandra (DataStax), and Apache Hadoop (Cloudera).
Popular Tags:
35
Real Time Analytics for Big Data – Lessons from Twitter (and beyond).. DeWayne Filppi @dfilppi
Transcript
Page 1: Bigdata analytics-twitter

Real Time Analytics for Big Data – Lessons from Twitter (and beyond)..

DeWayne Filppi@dfilppi

Page 2: Bigdata analytics-twitter

Big Data Predictions

“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”

Edd Dumbill, O’REILLY

2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

Page 3: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3

The Two Vs of Big Data

Velocity Volume

Page 4: Bigdata analytics-twitter

We’re Living in a Real Time World…Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4

Page 5: Bigdata analytics-twitter

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5

Page 6: Bigdata analytics-twitter

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6

Page 7: Bigdata analytics-twitter

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7

Page 8: Bigdata analytics-twitter

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8

Page 9: Bigdata analytics-twitter

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9

Page 10: Bigdata analytics-twitter

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10

This is what we’re here to discuss

Page 11: Bigdata analytics-twitter

TWITTER REAL-TIMEANALYTICS SYSTEM

11

Page 12: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12

It takes a week for users to

send 3 billion Tweets.

Twitter in Numbers (Jan 2013)

Source: http://blog.twitter.com/2011/03/numbers.html

Page 13: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13

On average,

500 million tweets get sent every day.

Twitter in Numbers (Jan 2013)

Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l

Page 14: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14

The highest throughput to date is

33,388 tweets/sec.

Twitter in Numbers (Jan 2013)

http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html

Page 15: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15

1,000,000 new accounts

are created daily.

Twitter in Numbers (March 2011)

Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b33589

Page 16: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16

5% of the users generate

75% of the content.

Twitter in Numbers

Source: http://www.sysomos.com/insidetwitter/

Page 17: Bigdata analytics-twitter

(Tens of) thousands of tweets per second to process Assumption: Need to process in near real time

Aggregate counters for each word A few 10s of thousands of words (or hundreds of

thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant

Analyze the Problem

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17

Page 18: Bigdata analytics-twitter

Key Elements in Real Time Big Data Analytics

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18

Page 19: Bigdata analytics-twitter

Sharding (Partitioning)

Tokenizer1 Filterer 1

Tokenizer2 Filterer 2

Tokenizer 3 Filterer 3

Tokenizer n Filterer n

Counter Updater 1

Counter Updater 2

Counter Updater 3

Counter Updater n

Page 20: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20

Use EDA (Event Driven Architecture)

TokenizerRaw FiltererTokenized

Counter/

AggregatorFiltered

Page 21: Bigdata analytics-twitter

Twitter Storm

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21

Page 22: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22

Twitter Storm With Hadoop

Page 23: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23

Storm Overview

Page 24: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24

Streams Unbounded sequence of tuples

Spouts Source of streams (Queues)

Bolts Functions, Filters, Joins, Aggregations

Topologies

Storm ConceptsSpouts

Bolt

Topologies

Page 25: Bigdata analytics-twitter

Challenge – Word Count

Word:Count

Tweets

Count?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25

• Hottest topics• URL mentions• etc.

Page 26: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26

Streaming word count with Storm

Page 27: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27

Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using

batching. Storm processes streams. The stream provider itself needs to

support persistency, batching, and reliability.

Tweets, events,whatever….

Page 28: Bigdata analytics-twitter

XAP Real Time Analytics

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28

Page 29: Bigdata analytics-twitter

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Two Layer Approach Advantage: Minimal

“impedance mismatch” between layers.– Both NoSQL cluster

technologies, with similar advantages

Grid layer serves as an in memory cache for interactive requests.

Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability.

In Memory Compute Cluster

NoSQL Cluster

...

Raw

Eve

nt S

trea

m

Raw

Eve

nt S

trea

m

Raw

Eve

nt S

trea

m

Raw And Derived Events

Rep

orti

ng E

ngin

e

SCALE

SCALE

Page 30: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30

Simplified Architecture

Page 31: Bigdata analytics-twitter

Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable

layer Data grid provides a transactional/consistent façade on NoSQL

store (in this case eliminating SQL database entirely)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31

Key Concepts

Page 32: Bigdata analytics-twitter

Keep Things In Memory

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 33: Bigdata analytics-twitter

Take Aways A data grid can serve different needs for big data analytics:

Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state

Provide a general purpose analytics platform– Roll your own

Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS

Page 34: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34

Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analy

tics-with-storm Learn and fork the code on github:

https://github.com/Gigaspaces/storm-integration

Twitter Storm: http://storm-project.net

XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/

References

Page 35: Bigdata analytics-twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved35


Recommended