Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | mats-kindahl |
View: | 117 times |
Download: | 6 times |
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.1
Insert Picture Here
MySQL Applier for Apache HadoopReal-Time Event Streaming to HDFSMats KindahlNeha KumariShubhangi Garg
2013-09-21
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.2
The following is intended to outline our general product direction. It is intended
for information purposes only, and may not be incorporated into any contract.
It is not a commitment to deliver any material, code, or functionality, and
should not be relied upon in making purchasing decision. The development,
release, and timing of any features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.
Safe Harbor Statement
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.3
Presentation Outline
● Why Big Data?
● Working with Big Data
● MySQL Applier for Hadoop
● Road map
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.4
Why Big Data?
● Reporting● Predefined data
● Viewing history● Past occurrences
● Using Sales Data● Typically in database
● Analytics● Data mining
● Predicting future● Trends
● Using all available data● Sales● Click stream● Likes/Tweets
Traditional Approach Big Data
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.5
Why Big Data?
● Web Recommendations
● Sentiment Analysis
● Marketing Campaign Analysis
● Customer Churn Modeling
● Fraud Detection
● Research and Development
● Risk Modeling
● Machine Learning
90% with Pilot Projects at end of 2012
Poor Data Costs 35% in Annual
Revenues
10% Improvement in Data Usability Drives $2bn in
RevenueSource: http://wikibon.org/blog/big-data-statistics/
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.6
Why Hadoop?
● Scales to thousands of nodes● Combines data from multiple
sources● Handles unstructured data● Run queries against all of the
data
● Runs on commodity servers● Easy to set up● Affordable
● Fault-tolerant● File block replication● Self-healing
● Map/Reduce● Distributed processing model● Good for large data sets
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.7
Example Use-Case: On-Line Retail
Browsing
Recommendations Recommendations
UpdatesPreferences
Brands “Liked”
Web LogsPage ViewsComments
CustomersP
urchaseH
istory
Purchases
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.8
Big Data Lifecycle
Decide
Organize
Acquire
Applier
Analyze
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.9
Hadoop Tools: In the Lifecycle
Apache SqoopMySQL Applier for Hadoop
Apache Flume
Apache DrillApache HiveApache Pig
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.10
Hadoop Tools: Apache Sqoop
● Apache top-level project● Part of Hadoop project● Developed by Cloudera
● Bulk data import and export● Between Hadoop HDFS and external data stores
● Support JDBC connector architecture● Supports plug-ins for specific functionality● “Fast-path” connector for MySQL
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.11
Hadoop Tools: Apache Sqoop
SqoopJob
SqoopJob
SqoopJob
SqoopJob
SqoopJob
Hadoop Cluster
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.12
Hadoop Tools: Apache Flume
● Apache top-level project● Part of Hadoop project
● Collecting log data● Various sources: Avro, Thrift, Syslog, Netcat● Can aggregate and consolidate data
● Data typically sent to HDFS● Can store data in other “sinks” as well
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.13
Hadoop Tools: Apache Flume
Source Sink
HDFSChannel
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.14
New Tool: MySQL Applier for Hadoop
● Using Binlog API● Proof of concept
● Replication from MySQL to HDFS● Exploit replication protocol● Read server binary log
● Fetches changes from MySQL● Using Binary Log API● Row-based replication● Caveat: DDL not handled
● Stores changes into HDFS● Consumable by other tools● Caveat: only row inserts● Considering update/delete
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.15
New Tool: MySQL Applier for Hadoop
HDFS
BinlogAPI libhdfs
Binary LogEvents
MySQL Applier for Hadoop
TimestampPrimary Key
Data
DecodeRow
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.16
MySQL Applier for Hadoop:Requirements
● MySQL 5.6 or later● Available at http://dev.mysql.com/downloads/mysql
● MySQL Applier for Hadoop● Available at http://labs.mysql.com
● Apache Hadoop 1.0.4 or later● Available at http://hadoop.apache.org/releases.html
● Apache Hive or other Hadoop Tool for analysis● Available at http://hive.apache.org/releases.html
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.17
Hadoop Applier for Hadoop:Mapping Rows
● Timestamp column is added first in table
● Timestamp from binary log
INSERT INTO test.tbl VALUES (23456,'Sanjai','Feldhoffer'), (23457,'Manohar','Kakkar'), (23458,'Christ','Kalefeld'), (23459,'Gretta','Varker'), (23460,'Masato','Steinauer'), (23461,'Baruch','Uchoa');
1379361681,23456,Sanjai,Feldhoffer1379361685,23457,Manohar,Kakkar1379361692,23458,Christ,Kalefeld1379361693,23459,Gretta,Varker1379361699,23460,Masato,Steinauer1379361703,23461,Baruch,Uchoa
MySQL HDFS
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.18
Hadoop Applier for Hadoop:Using Hive
● Does not handle DDL● Create table manually as above
● MySQL Applier field and row delimiter can be controlledfielddelimiterrowdelimiter
CREATE TABLE tbl (
user_id INT PRIMARY KEY, first CHAR(60), last CHAR(60))
CREATE TABLE tbl ( ts INT, user_id INT, first STRING, last STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
SQL HDFS
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.19
Hadoop Applier for Hadoop
● Start MySQL Applier for Hadoop
happlier fielddelimiter=, \ mysql://[email protected] hdfs://example.com:9000
● Inserts written to files in warehouse directory
● Default: /user/hive/warehouse
● MySQL Table: test.tblHDFS: /user/hive/warehouse/test.db/tbl/datafile1.txt
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.20
Hadoop Applier for Hadoop:Update and Delete?
● Batch import using Sqoop● Transfer all data each time● If changes are small, bandwidth is
wasted
Sqoop
Hadoop Rack
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.21
Hadoop Applier for Hadoop:Update and Delete?
● Batch import using Sqoop● Transfer all data each time● If changes are small, bandwidth is
wasted
● Incremental import using Applier● Only changes imported● Bandwidth is used efficiently● … but what about updates and
deletes?Applier
Hadoop Rack
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.22
Hadoop Applier for Hadoop:Update and Delete?
● Problem:● HDFS is append-only● Rows inserted are appended to file● How can rows be updated or deleted?
● Idea:● Rows updated are appended to file● Rows have primary key● Row contain after-image and timestamp of update● For each primary key, pick row with latest timestamp
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.23
Hadoop Applier for Hadoop:Update and Delete?
Applier
Hadoop Rack
● Timestamped rows to HDFS● After image for updates● Flag deletes
● Customized HiveQL queries
SELECT … FROM tblWHERE ts = MAX(ts)GROUP BY key
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.24
Hadoop Applier for Hadoop:Update and Delete?
Clean
DirtyApplier
CleaningJob
Hadoop Rack
● Timestamped rows to HDFS● After image for updates● Flag deletes
● Special “cleaning“ job● Read dirty files● Write clean files● Moving data inside rack use
bandwidth efficiently
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.25
MySQL and Hadoop:Resources and Information
● MySQL and Hadoop: Guide to Big Data Integration
http://www.mysql.com/why-mysql/white-papers/mysql-and-hadoop-guide-to-big-data-integration
● MySQL Applier for Hadoop
http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html
● Developer Blogs● Mats Kindahl: http://mysqlmusings.blogspot.com● Shubhangi Garg: http://innovating-technology.blogspot.in● Neha Kumari: http://nehakumari19.blogspot.in
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.26
Thank you!