©Continuent 2014.
From Dolphins to Elephants: Real-Time MySQL to Hadoop
Replication with TungstenMC Brown, Director of Documentation
Linas Virbalas, Senior Software Engineer
©Continuent 2014
About Tungsten Replicator
• Open source drop-in replacement for MySQL replication, providing:
• Global transaction ID
• Multiple masters
• Multiple sources
• Flexible topologies
• Heterogeneous replication
• Parallel replication
���2
©Continuent 2014
Tungsten Replicator
���3
Master
(Transactions + Metadata)
Slave
THL
DBMS Logs
Replicator
(Transactions + Metadata)
THLReplicator
Download transactions via network
Apply using JDBC
©Continuent 2014
How Tungsten Replicator Works
���4
Extract Filter Apply
StageExtract Filter Apply
StageExtract Filter Apply
Stage
Pipeline
Master DBMS
Transaction History Log
In-Memory Queue
Slave DBMS
©Continuent 2014
Where we replicate
���5
star-schema
master-slave Heterogene Direct slave
fan-in slave all-masters
MySQL
Oracle
Oracle
MySQLRegular MySQL
©Continuent 2014
Why Hadoop
• Customer driven
• Change in the air
• Environments moving to heterogenous
• NoSQL was the first
• We already support MongoDB
• Hadoop used for big analytics
• More frequently a live resource
• Big datasets require Map/Reduce
���6
©Continuent 2014
Tungsten Replicator and Hadoop
• Extract from MySQL or Oracle
• Base Hadoop and Commercial distributions; Cloudera, HortonWorks, Amazon Elastic MapReduce and IBM InfoSphere BigInsights compatible
• Automatic replication of incremental changes Customizable formatting
• Hive Schema generation
• Materialized views in Hive for carbon-copy tables
• Sqoop and parallel extractor compatibility for provisioning
���7
©Continuent 2014
Applying Data into Hadoop
���8
DBMS Logs
Replicator
Extract transactions
from log
THL
Replicator
CSVHadoop
©Continuent 2014
Applying Data into Hadoop
���9
DBMS Logs
Replicator
Extract transactions
from log
THL
Replicator
CSVHadoop
©Continuent 2014
Applying Data into Hadoop
���10
DBMS Logs
Replicator
Extract transactions
from log
THL
Replicator
CSVHadoop
©Continuent 2014 ���11
Hadoop
CSV (Staging)
ID Message
Hive Table
Materialized Views
©Continuent 2014 ���12
Hadoop
CSV (Staging)
ID Message
Hive Table
Materialised Views
©Continuent 2014 ���13
Hadoop
CSV (Staging)
ID Message
Hive Table
Materialised Views
©Continuent 2014 ���14
Hadoop
CSV (Staging)
ID Message
Hive Table
Materialized Views
©Continuent 2014
MySQL Configuration
• Use Row-based replication
• Every table must have primary keys
• Replicator configured with:
• Filters for metadata and primary key optimisation
• Extracts to standard THL
���15
©Continuent 2014
Configure Hadoop
• Data is stored in CSV format on HDFS
• Cloudera, HortonWorks, Amazon Elastic Map Reduce (EMR) and IBM Infosphere BigInsights compatible
• Compatible with Hive, HBase, and others
• Staging DDL can be automatically generated
• Live Table DDL can be automatically generated
���16
©Continuent 2014
DDL Generation
• Built-in Tool, part of Tungsten Replicator
• Handles staging and live table DDL generation
• Default mode is for default migrations to Hive types
• Customizable for your needs
• BigInts as Strings
• Data transformations possible through filters
���17
©Continuent 2014
Replicator Hadoop Configuration
• Batch Commit interval
• By rows count
• By time interval
• CSV Format
• Predefined formats
• Customizable by field and row characters
• Parallelization Supported
���18
©Continuent 2014
Materialized Views
• Merges Data from Staging CSV into Hive Tables
• Processing separate from Replicator
• Allows individual table views to be generated independently
• Allows for custom materialization intervals
• Views based on 'live' data, or by point-in-time from CSV staging
���19
©Continuent 2014
Demo
���20
©Continuent 2014
Provisioning Data
• Sqoop
• Start the replicator
• Sqoop the data
• Materialized views are idempotent
• DDL generation is Hive compatible
• Parallel Extractor
• Currently Oracle only
• Will extract data in parallel and insert into THL
���21
©Continuent 2014
Replication Management
• Replication can be stopped, started, restarted at any time
• Enables MySQL or Hadoop maintenance windows
• DDL customizable
• Views regenerated at any time
• Schema changes can be handled by re-Sqooping and dematerialising views
���22
Master Slave Hot Standby Failed
©Continuent 2014
Continuent Web Page: http://www.continuent.com
!
Tungsten Replicator 2.2 and 3.0 Preview: http://code.google.com/p/tungsten-replicator
Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://flyingclusters.blogspot.com http://www.continuent.com/news/blogs
560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]