+ All Categories
Home > Documents > Analyzing twitter data with hadoop

Analyzing twitter data with hadoop

Date post: 27-Jan-2015
Category:
Upload: open-analytics
View: 118 times
Download: 3 times
Share this document with a friend
Description:
Cloudera's OA
Popular Tags:
47
1 Analyzing Twitter Data with Hadoop Open Analytics Summit, March 2013 Joey Echeverria | Principal Solutions Architect [email protected] | @fwiffo ©2012 Cloudera, Inc.
Transcript
Page 1: Analyzing twitter data with hadoop

1

Analyzing Twitter Data with HadoopOpen Analytics Summit, March 2013Joey Echeverria | Principal Solutions [email protected] | @fwiffo

©2012 Cloudera, Inc.

Page 2: Analyzing twitter data with hadoop

2

About Joey

• Principal Solutions Architect• 18 months• 4+ years• Local

Page 3: Analyzing twitter data with hadoop

3

Analyzing Twitter Data with Hadoop

BUILDING A BIG DATA SOLUTION

©2012 Cloudera, Inc.

Page 4: Analyzing twitter data with hadoop

4

Big Data

• Big• Larger volume than you’ve handled before

• No litmus test• High value, under utilized

• Data• Structured• Unstructured• Semi-structured

• Hadoop• Distributed file system• Distributed, batch computation

©2012 Cloudera, Inc.

Page 5: Analyzing twitter data with hadoop

5

Data Management Systems

©2012 Cloudera, Inc.

Data Source Data StorageData

Ingestion

Data Processing

Page 6: Analyzing twitter data with hadoop

6

Relational Data Management Systems

©2012 Cloudera, Inc.

Data Source RDBMSETL

Reporting

Page 7: Analyzing twitter data with hadoop

7

A Canonical Hadoop Architecture

©2012 Cloudera, Inc.

Data Source HDFSFlume

Hive (Impala)

Page 8: Analyzing twitter data with hadoop

8

Analyzing Twitter Data with Hadoop

AN EXAMPLE USE CASE

©2012 Cloudera, Inc.

Page 9: Analyzing twitter data with hadoop

9

Analyzing Twitter

• Social media popular with marketing teams• Twitter is an effective tool for promotion• Who is influential?

• Tweets• Followers• Retweets

• Similar to e-mail forwarding

• Which twitter user gets the most retweets?• Who is influential in our industry?

©2012 Cloudera, Inc.

Page 10: Analyzing twitter data with hadoop

10

Analyzing Twitter Data with Hadoop

HOW DO WE ANSWER THESE QUESTIONS?

©2012 Cloudera, Inc.

Page 11: Analyzing twitter data with hadoop

11

Techniques

• SQL• Filtering• Aggregation• Sorting

• Complex data• Deeply nested• Variable schema

Page 12: Analyzing twitter data with hadoop

12

Architecture

©2012 Cloudera, Inc.

Twitter

HDFSFlume Hive

CustomFlumeSource

Sink toHDFS

JSON SerDeParses Data

Oozie

AddPartitions

Hourly

Page 13: Analyzing twitter data with hadoop

13

Analyzing Twitter Data with Hadoop

TWITTER SOURCE

©2012 Cloudera, Inc.

Page 14: Analyzing twitter data with hadoop

14

Flume

• Streaming data flow• Sources

• Push or pull• Sinks• Event based

©2012 Cloudera, Inc.

Page 15: Analyzing twitter data with hadoop

Pulling Data From Twitter

• Custom source, using twitter4j• Sources process data as discrete events

Page 16: Analyzing twitter data with hadoop

Loading Data Into HDFS

• HDFS Sink comes stock with Flume• Easily separate files by creation time

• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

Page 17: Analyzing twitter data with hadoop

17

Flume Source

©2012 Cloudera, Inc.

public class TwitterSource extends AbstractSource implements EventDrivenSource, Configurable { ... // The initialization method for the Source. The context contains all // the Flume configuration info @Override public void configure(Context context) { ... } ... // Start processing events. Uses the Twitter Streaming API to sample // Twitter, and process tweets. @Override public void start() { ... } ... // Stops Source's event processing and shuts down the Twitter stream. @Override public void stop() { ... }}

Page 18: Analyzing twitter data with hadoop

18

Twitter API

• Callback mechanism for catching new tweets

©2012 Cloudera, Inc.

/** The actual Twitter stream. It's set up to collect raw JSON data */private final TwitterStream twitterStream = new TwitterStreamFactory( new ConfigurationBuilder().setJSONStoreEnabled(true).build()) .getInstance();...// The StatusListener is a twitter4j API that can be added to a stream,// and will call a method every time a message is sent to the stream.StatusListener listener = new StatusListener() { // The onStatus method is executed every time a new tweet comes in. public void onStatus(Status status) { ... }}...// Set up the stream's listener (defined above), and set any necessary// security information.twitterStream.addListener(listener);twitterStream.setOAuthConsumer(consumerKey, consumerSecret);AccessToken token = new AccessToken(accessToken, accessTokenSecret);twitterStream.setOAuthAccessToken(token);

Page 19: Analyzing twitter data with hadoop

19

JSON Data

• JSON data is processed as an event and written to HDFS

©2012 Cloudera, Inc.

public void onStatus(Status status) { // The EventBuilder is used to build an event using the headers and // the raw JSON of a tweet

headers.put("timestamp", String.valueOf( status.getCreatedAt().getTime())); Event event = EventBuilder.withBody( DataObjectFactory.getRawJSON(status).getBytes(), headers); channel.processEvent(event);}

Page 20: Analyzing twitter data with hadoop

20

Analyzing Twitter Data with Hadoop

FLUME DEMO

©2012 Cloudera, Inc.

Page 21: Analyzing twitter data with hadoop

21

Analyzing Twitter Data with Hadoop

HIVE

©2012 Cloudera, Inc.

Page 22: Analyzing twitter data with hadoop

22

What is Hive?

• Created at Facebook• HiveQL

• SQL like interface• Hive interpreter

converts HiveQL to MapReduce code

• Returns results to the client

©2012 Cloudera, Inc.

Page 23: Analyzing twitter data with hadoop

23

Hive Details

• Schema on read• Scalar types (int, float, double, boolean, string)• Complex types (struct, map, array)• Metastore contains table definitions

• Stored in a relational database• Similar to catalog tables in other DBs

Page 24: Analyzing twitter data with hadoop

24

Complex Data

©2012 Cloudera, Inc.

SELECT  t.retweeted_screen_name,  sum(retweets) AS total_retweets,  count(*) AS tweet_countFROM (SELECT   retweeted_status.user.screen_name AS retweet_screen_name,     retweeted_status.text,     max(retweet_count) AS retweets FROM tweets   GROUP BY

retweeted_status.user.screen_name,       retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;

Page 25: Analyzing twitter data with hadoop

25

Analyzing Twitter Data with Hadoop

JSON INTERLUDE

©2012 Cloudera, Inc.

Page 26: Analyzing twitter data with hadoop

26

What is JSON?

• Complex, semi-structured data• Based on JavaScript’s data syntax• Rich, nested data types:

• number• string• Array• object• true, false• null

©2012 Cloudera, Inc.

Page 27: Analyzing twitter data with hadoop

27

What is JSON?

©2012 Cloudera, Inc.

{ "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } }}

Page 28: Analyzing twitter data with hadoop

28

Hive Serializers and Deserializers

• Instructs Hive on how to interpret data• JSONSerDe

©2012 Cloudera, Inc.

Page 29: Analyzing twitter data with hadoop

29

Analyzing Twitter Data with Hadoop

HIVE DEMO

©2012 Cloudera, Inc.

Page 30: Analyzing twitter data with hadoop

30

Analyzing Twitter Data with Hadoop

IT’S A TRAP

©2012 Cloudera, Inc.

Page 31: Analyzing twitter data with hadoop

31

Not a Database

©2012 Cloudera, Inc.

RDBMS Hive

LanguageGenerally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Update Capabilities INSERT, UPDATE, DELETE

INSERT OVERWRITE no UPDATE, DELETE

Transactions Yes No

Latency Sub-second Minutes

Indexes Yes Yes

Data size Terabytes Petabytes

Page 32: Analyzing twitter data with hadoop

32

Analyzing Twitter Data with Hadoop

IMPALA ASIDE

©2012 Cloudera, Inc.

Page 33: Analyzing twitter data with hadoop

33

Cloudera ImpalaReal-Time Query for Data Stored in Hadoop.

FAMILIAR Supports Hive SQL

FAST 4-30X faster than Hive over MapReduce

Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED

100% OPEN SOURCE Flexible, cost-effective, no lock-in

EASY TO USE Deploy & operate withCloudera Enterprise RTQ

FLEXIBLE Supports multiple storage engines & file formats

©2012 Cloudera, Inc.

Page 34: Analyzing twitter data with hadoop

34

Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop

SPEED TO INSIGHT

COST SAVINGS

FULL FIDELITY ANALYSIS

DISCOVERABILITY

• Real-time queries run directly on source data• No ETL delays• No jumping between data silos

• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas

• All data available for interactive queries• No loss of fidelity from fixed data schemas

• Single metadata store from origination through analysis• No need to hunt through multiple data silos

©2012 Cloudera, Inc.

Page 35: Analyzing twitter data with hadoop

Cloudera Impala Details

35 ©2012 Cloudera, Inc.

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

Local Direct Reads

State Store

HDFS NNHive Metastore YARN

Common Hive SQL and interface

Unified metadata and scheduler

Low-latency scheduler and cache(low-impact failures)

Page 36: Analyzing twitter data with hadoop

36

Analyzing Twitter Data with Hadoop

OOZIE AUTOMATION

©2012 Cloudera, Inc.

Page 37: Analyzing twitter data with hadoop

Oozie: Everything in its Right Place

Page 38: Analyzing twitter data with hadoop

Oozie for Partition Management

• Once an hour, add a partition• Takes advantage of advanced Hive functionality

Page 39: Analyzing twitter data with hadoop

39

Analyzing Twitter Data with Hadoop

OOZIE DEMO

©2012 Cloudera, Inc.

Page 40: Analyzing twitter data with hadoop

40

Analyzing Twitter Data with Hadoop

PUTTING IT ALL TOGETHER

©2012 Cloudera, Inc.

Page 41: Analyzing twitter data with hadoop

41

Complete Architecture

©2012 Cloudera, Inc.

Twitter

HDFSFlume Hive

CustomFlumeSource

Sink toHDFS

JSON SerDeParses Data

Oozie

AddPartitions

Hourly

Page 42: Analyzing twitter data with hadoop

42

Analyzing Twitter Data with Hadoop

MORE DEMOS

©2012 Cloudera, Inc.

Page 44: Analyzing twitter data with hadoop

My personal preference

• Cloudera Manager• https://ccp.cloudera.com/display/SUPPORT/Downloads

• Free up to 50 nodes

Page 46: Analyzing twitter data with hadoop

Questions?

• Contact me!• Joey Echeverria• [email protected]• @fwiffo

• We’re hiring!

Page 47: Analyzing twitter data with hadoop

47 ©2012 Cloudera, Inc.


Recommended