1
Analyzing Twitter Data with HadoopGwen Shapira, Software Engineer@Gwenshap
©2012 Cloudera, Inc.
All meetings located in Moscone South - Room 208
Monday, September 29Exadata SIG: 2:00 p.m. - 3:00 p.m.BIWA SIG: 5:00 p.m. – 6:00 p.m.
Tuesday, September 30Internet of Things SIG: 11:00 a.m. - 12:00 p.m.Storage SIG: 4:00 p.m. - 5:00 p.m.SPARC/Solaris SIG: 5:00 p.m. - 6:00 p.m.
Wednesday, October 1Oracle Enterprise Manager SIG: 8:00 a.m. - 9:00 a.m.Big Data SIG: 10:30 a.m. - 11:30 a.m.Oracle 12c SIG: 2:00 p.m. – 3:00 p.m.Oracle Spatial and Graph SIG: 4:00 p.m. (*OTN lounge)
IOUG SIG Meetings at OpenWorld
• Save more than $1,000 on education offerings like pre-conference workshops• Access the brand-new, specialized IOUG Strategic Leadership Program• Priority access to the hands-on labs with Oracle ACE support• Advance access to supplemental session material and presentations• Special IOUG activities with no "ante in" needed - evening networking opportunities
and more
COLLABORATE 15 – IOUG ForumApril 12-16, 2015
Mandalay Bay Resort and CasinoLas Vegas, NV
COLLABORATE 15 Call for Speakers
Ends October 10
The IOUG Forum Advantage
www.collaborate.ioug.org
Follow us on Twitter at @IOUG or via the conference hashtag #C15LV!
©2014 Cloudera, Inc. All rights reserved.
I have15 years of experience in
moving data around
©2014 Cloudera, Inc. All rights reserved.
• Oracle ACE Director• Member of Oak Table• Blogger• Presenter – Hotsos, IOUG, OOW, OSCON• NoCOUG board• Contributor to Apache Oozie, Sqoop, Kafka• Author – Hadoop Application Architectures
In my spare time…
6
Analyzing Twitter Data with Hadoop
BUILDING AN HADOOP APPLICATION
©2012 Cloudera, Inc.
7
8
Hive Level Architecture
©2012 Cloudera, Inc.
Data Source HDFSFlume
Hive +Oozie
Impala / Oracle
9
Analyzing Twitter Data with Hadoop
AN EXAMPLE USE CASE
©2012 Cloudera, Inc.
10
Analyzing Twitter
• Social media popular with marketing teams• Twitter is an effective tool for promotion• Which twitter user gets the most retweets?• Who is influential in our industry?• Which topics are trending?
©2012 Cloudera, Inc.
11
Analyzing Twitter Data with Hadoop
HOW DO WE ANSWER THESE QUESTIONS?
©2012 Cloudera, Inc.
12
Techniques
• Bring Data with Flume• Complex data
• Deeply nested• Variable schema
• Clean, Standardize, Partition, etc• SQL
• Filtering• Aggregation• Sorting
13
Analyzing Twitter Data with Hadoop
FLUME
14
Flume Agent design
15
In our case…
• Twitter source• Pulls JSON format files from twitter
• Memory Channel• HDFS Sink – directory per hour
16
What is JSON?
©2012 Cloudera, Inc.
{ "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } }}
17
But Wait! There’s More!
• Many sources – directory, files, log4j, net, JMS• Interceptors – process data in flight• Selectors – choose which sink• Many channels – Memory, file• Many sinks – HDFS, Hbase, Solr
18
High Level Pipeline Architecture
Web App Flume Avro Client
Web App Flume Avro Client
Web App Flume Avro Client
Web App Flume Avro Client
Web App Flume Avro Client
Web App Flume Avro Client
Web App Flume Avro Client
Web App Flume Avro Client
Flume Agent
Flume Agent
Flume Agent
Flume Agent
HDFS
SparkStreaming HBase
Report App
Fan-in Pattern
Multi Agents for Failover and rolling restarts
SparkStreaming data is sub set of whole events
ML Map/Reduce Jobs
Batch Report Updates
Pull Near Real Time Results
Query With Hbase API Or Impala
Client providing, multi-threading, compression, encryption, and batching
19
TwitterAgent.sources = TwitterTwitterAgent.channels = MemChannelTwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels = MemChannelTwitterAgent.sources.Twitter.consumerKey = TwitterAgent.sources.Twitter.consumerSecret = TwitterAgent.sources.Twitter.accessToken = TwitterAgent.sources.Twitter.accessTokenSecret = TwitterAgent.sources.Twitter.keywords = hadoop, big data, flume, sqoop, oracle, oow
TwitterAgent.sinks.HDFS.channel = MemChannelTwitterAgent.sinks.HDFS.type = hdfsTwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart :8020/user/flume/tweets/%Y/%m/%d/%H/TwitterAgent.sinks.HDFS.serializer = text
TwitterAgent.channels.MemChannel.type = memory
Configuration
20
Analyzing Twitter Data with Hadoop
FLUME DEMO
©2012 Cloudera, Inc.
21
Analyzing Twitter Data with Hadoop
HIVE
©2012 Cloudera, Inc.
22
What is Hive?
• Created at Facebook• HiveQL
• SQL like interface• Hive interpreter
converts HiveQL to MapReduce code
• Returns results to the client
©2012 Cloudera, Inc.
23
Hive Details
• Metastore contains table definitions• Stored in a relational database• Basically a data dictionary
• SerDes parse data • and converts to table/column structure• SerDe:
• CSV, XML, JSON, Avro, Parquet, OCR files• Or write your own (We created one for CopyBook)
24
Complex Data
©2012 Cloudera, Inc.
SELECT t.retweet_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_countFROM (SELECT retweeted_status.user.screen_name AS retweet_screen_name, retweeted_status.text, max(retweeted_status.retweet_count) AS retweets FROM tweets GROUP BY
retweeted_status.user.screen_name, retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;
25
Analyzing Twitter Data with Hadoop
HIVE DEMO
©2012 Cloudera, Inc.
26
Analyzing Twitter Data with Hadoop
IT’S A TRAP
©2012 Cloudera, Inc.
27
Not a Database
©2012 Cloudera, Inc.
RDBMS Hive Impala
LanguageGenerally >= SQL-92
Subset of SQL-92 plus Hive specific extensions
Subset of SQL-92
Update Capabilities
INSERT, UPDATE, DELETE
Bulk INSERT, UPDATE, DELETE
Insert, truncate
Transactions Yes Yes No
Latency Sub-second Minutes Sub-second
Indexes Yes Yes No
Data size Few Terabytes Petabytes Lots of Terabytes
28
Analyzing Twitter Data with Hadoop
DATA FORMATS
29
I don’t like our data
• Lots of small files• JSON – requires parsing• Can’t compress• Sensitive to changes
30
I’d rather use Avro
• Few large files containing records• Schema in file• Schema evolution• Can compress• Well supported in Hadoop• Clients in other languages
31
Lets convert
• Create table AVRO_TWEETS• Insert into Avro_tweets
select …. From tweets
32
Analyzing Twitter Data with Hadoop
IMPALA ASIDE
©2012 Cloudera, Inc.
33
Cloudera ImpalaReal-Time Query for Data Stored in Hadoop.
FAMILIAR Supports Hive SQL
FAST 4-30X faster than Hive over MapReduce
Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED
100% OPEN SOURCE Flexible, cost-effective, no lock-in
EASY TO USE Deploy & operate withCloudera Enterprise RTQ
FLEXIBLE Supports multiple storage engines & file formats
©2012 Cloudera, Inc.
34
Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop
SPEED TO INSIGHT
COST SAVINGS
FULL FIDELITY ANALYSIS
DISCOVERABILITY
• Real-time queries run directly on source data• No ETL delays• No jumping between data silos
• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas
• All data available for interactive queries• No loss of fidelity from fixed data schemas
• Single metadata store from origination through analysis• No need to hunt through multiple data silos
©2012 Cloudera, Inc.
Cloudera Impala Details
35 ©2012 Cloudera, Inc.
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBaseHDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP Distributed
Local Direct Reads
State Store
HDFS NNHive Metastore YARN
Common Hive SQL and interface
Unified metadata and scheduler
Low-latency scheduler and cache(low-impact failures)
LOAD DATA TO ORACLE
Oracle Connectors for Hadoop
• Oracle Loader for Hadoop
• Oracle SQL Connector for Hadoop
• BigData SQL
Oracle Loader for Hadoop
• Load data from Hadoop into Oracle• Map-Reduce job inside Hadoop• Converts data types, partitions and sorts• Direct path loads• Reduces CPU utilization on database • Supports Avro and compression
Oracle SQL Connector for Hadoop
• Run a Java app• Creates an external table• Runs MapReduce when external table is queries• Can use Hive Metastore for schema• Optimized for parallel queries• Supports Avro and compression
40
Big Data SQL
• Also external table• Can also use Hive metastore for schema• But …. NO MapReduce• Instead – an agent will do SMART SCANS
• Bloom filters• Storage indexes• Filters
• Supports any Hadoop data format
41
Analyzing Twitter Data with Hadoop
PUTTING IT ALL TOGETHER
©2012 Cloudera, Inc.
42
Hive Level Architecture
©2012 Cloudera, Inc.
Data Source HDFSFlume
Hive +Oozie
Impala / Oracle
What next?
• Download Hadoop!• CDH available at www.cloudera.com• Cloudera provides pre-loaded VMs
• https://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+Edition+Demo+VM
• Clone the source repo• https://github.com/cloudera/cdh-twitter-example
44 ©2012 Cloudera, Inc.