Download - 1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc.

1

Analyzing Twitter Data with HadoopGwen Shapira, Software Engineer@Gwenshap

©2012 Cloudera, Inc.

All meetings located in Moscone South - Room 208

Monday, September 29Exadata SIG: 2:00 p.m. - 3:00 p.m.BIWA SIG: 5:00 p.m. – 6:00 p.m.

Tuesday, September 30Internet of Things SIG: 11:00 a.m. - 12:00 p.m.Storage SIG: 4:00 p.m. - 5:00 p.m.SPARC/Solaris SIG: 5:00 p.m. - 6:00 p.m.

Wednesday, October 1Oracle Enterprise Manager SIG: 8:00 a.m. - 9:00 a.m.Big Data SIG: 10:30 a.m. - 11:30 a.m.Oracle 12c SIG: 2:00 p.m. – 3:00 p.m.Oracle Spatial and Graph SIG: 4:00 p.m. (*OTN lounge)

IOUG SIG Meetings at OpenWorld

• Save more than $1,000 on education offerings like pre-conference workshops• Access the brand-new, specialized IOUG Strategic Leadership Program• Priority access to the hands-on labs with Oracle ACE support• Advance access to supplemental session material and presentations• Special IOUG activities with no "ante in" needed - evening networking opportunities

and more

COLLABORATE 15 – IOUG ForumApril 12-16, 2015

Mandalay Bay Resort and CasinoLas Vegas, NV

COLLABORATE 15 Call for Speakers

Ends October 10

The IOUG Forum Advantage

www.collaborate.ioug.org

Follow us on Twitter at @IOUG or via the conference hashtag #C15LV!

©2014 Cloudera, Inc. All rights reserved.

I have15 years of experience in

moving data around

©2014 Cloudera, Inc. All rights reserved.

• Oracle ACE Director• Member of Oak Table• Blogger• Presenter – Hotsos, IOUG, OOW, OSCON• NoCOUG board• Contributor to Apache Oozie, Sqoop, Kafka• Author – Hadoop Application Architectures

In my spare time…

6

Analyzing Twitter Data with Hadoop

BUILDING AN HADOOP APPLICATION


7

8

Hive Level Architecture


Data Source HDFSFlume

Hive +Oozie

Impala / Oracle

9


AN EXAMPLE USE CASE


10

Analyzing Twitter

• Social media popular with marketing teams• Twitter is an effective tool for promotion• Which twitter user gets the most retweets?• Who is influential in our industry?• Which topics are trending?


11


HOW DO WE ANSWER THESE QUESTIONS?


12

Techniques

• Bring Data with Flume• Complex data

• Deeply nested• Variable schema

• Clean, Standardize, Partition, etc• SQL

• Filtering• Aggregation• Sorting

13


FLUME

14

Flume Agent design

15

In our case…

• Twitter source• Pulls JSON format files from twitter

• Memory Channel• HDFS Sink – directory per hour

16

What is JSON?


{ "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } }}

17

But Wait! There’s More!

• Many sources – directory, files, log4j, net, JMS• Interceptors – process data in flight• Selectors – choose which sink• Many channels – Memory, file• Many sinks – HDFS, Hbase, Solr

18

High Level Pipeline Architecture

Web App Flume Avro Client








Flume Agent

Flume Agent

Flume Agent

Flume Agent

HDFS

SparkStreaming HBase

Report App

Fan-in Pattern

Multi Agents for Failover and rolling restarts

SparkStreaming data is sub set of whole events

ML Map/Reduce Jobs

Batch Report Updates

Pull Near Real Time Results

Query With Hbase API Or Impala

Client providing, multi-threading, compression, encryption, and batching

19

TwitterAgent.sources = TwitterTwitterAgent.channels = MemChannelTwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels = MemChannelTwitterAgent.sources.Twitter.consumerKey = TwitterAgent.sources.Twitter.consumerSecret = TwitterAgent.sources.Twitter.accessToken = TwitterAgent.sources.Twitter.accessTokenSecret = TwitterAgent.sources.Twitter.keywords = hadoop, big data, flume, sqoop, oracle, oow

TwitterAgent.sinks.HDFS.channel = MemChannelTwitterAgent.sinks.HDFS.type = hdfsTwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart :8020/user/flume/tweets/%Y/%m/%d/%H/TwitterAgent.sinks.HDFS.serializer = text

TwitterAgent.channels.MemChannel.type = memory

Configuration

20


FLUME DEMO


21


HIVE


22

What is Hive?

• Created at Facebook• HiveQL

• SQL like interface• Hive interpreter

converts HiveQL to MapReduce code

• Returns results to the client


23

Hive Details

• Metastore contains table definitions• Stored in a relational database• Basically a data dictionary

• SerDes parse data • and converts to table/column structure• SerDe:

• CSV, XML, JSON, Avro, Parquet, OCR files• Or write your own (We created one for CopyBook)

24

Complex Data


SELECT t.retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_countFROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweeted_status.retweet_count) AS retweets FROM tweets GROUP BY

retweeted_status.user.screen_name, retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;

25


HIVE DEMO


26


IT’S A TRAP


27

Not a Database


RDBMS Hive Impala

LanguageGenerally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Subset of SQL-92

Update Capabilities

INSERT, UPDATE, DELETE

Bulk INSERT, UPDATE, DELETE

Insert, truncate

Transactions Yes Yes No

Latency Sub-second Minutes Sub-second

Indexes Yes Yes No

Data size Few Terabytes Petabytes Lots of Terabytes

28


DATA FORMATS

29

I don’t like our data

• Lots of small files• JSON – requires parsing• Can’t compress• Sensitive to changes

30

I’d rather use Avro

• Few large files containing records• Schema in file• Schema evolution• Can compress• Well supported in Hadoop• Clients in other languages

31

Lets convert

• Create table AVRO_TWEETS• Insert into Avro_tweets

select …. From tweets

32


IMPALA ASIDE


33

Cloudera ImpalaReal-Time Query for Data Stored in Hadoop.

FAMILIAR Supports Hive SQL

FAST 4-30X faster than Hive over MapReduce

Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED

100% OPEN SOURCE Flexible, cost-effective, no lock-in

EASY TO USE Deploy & operate withCloudera Enterprise RTQ

FLEXIBLE Supports multiple storage engines & file formats


34

Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop

SPEED TO INSIGHT

COST SAVINGS

FULL FIDELITY ANALYSIS

DISCOVERABILITY

• Real-time queries run directly on source data• No ETL delays• No jumping between data silos

• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas

• All data available for interactive queries• No loss of fidelity from fixed data schemas

• Single metadata store from origination through analysis• No need to hunt through multiple data silos


Cloudera Impala Details

35 ©2012 Cloudera, Inc.

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

Local Direct Reads

State Store

HDFS NNHive Metastore YARN

Common Hive SQL and interface

Unified metadata and scheduler

Low-latency scheduler and cache(low-impact failures)

LOAD DATA TO ORACLE

Oracle Connectors for Hadoop

• Oracle Loader for Hadoop

• Oracle SQL Connector for Hadoop

• BigData SQL

Oracle Loader for Hadoop

• Load data from Hadoop into Oracle• Map-Reduce job inside Hadoop• Converts data types, partitions and sorts• Direct path loads• Reduces CPU utilization on database • Supports Avro and compression

Oracle SQL Connector for Hadoop

• Run a Java app• Creates an external table• Runs MapReduce when external table is queries• Can use Hive Metastore for schema• Optimized for parallel queries• Supports Avro and compression

40

Big Data SQL

• Also external table• Can also use Hive metastore for schema• But …. NO MapReduce• Instead – an agent will do SMART SCANS

• Bloom filters• Storage indexes• Filters

• Supports any Hadoop data format

41


PUTTING IT ALL TOGETHER


42

Hive Level Architecture


Data Source HDFSFlume

Hive +Oozie

Impala / Oracle

What next?

• Download Hadoop!• CDH available at www.cloudera.com• Cloudera provides pre-loaded VMs

• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM

• Clone the source repo• https://github.com/cloudera/cdh-twitter-example

http://www.cloudera.com/

https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM

https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM

https://github.com/cloudera/cdh-twitter-example







44 ©2012 Cloudera, Inc.