New Database Replication and Data Integration with Hadoop and BI
Jeffrey Surretsky
NYOUG
December 2013
2
Big Data –Hadoop®
3
Petabyte
Exabyte
Zettabyte
Terabyte
The explosion of data continues to burden the data tool chain
Transactional DataTraditionally, only transactional data was generated and stored in databases
• Structured
• Measured growth
Human FilesBut over time, we started creating unstructured data
• Likes, tweets, relationships (social)
• Log files (machine)
• Exponential growth
Social & Machines have added exponentially
mainframe PC internet mobile machine
• Docs, Images, Video
• Multiple formats
• Fast growth
4
• Proliferation of new user generated data creation and data capture technologies
• Increased “interconnectedness” drives consumption (creating more data)
• Inexpensive storage makes it possible to keep more data longer
• Need to extract actionable insights from all data assets to gain competitive edge
*Source: IDC 2011
Big data market drivers
VelocityBatchNear timeReal timeStreams
VolumePetabytesRecordsTransactionsTables, files
VarietyStructuredUnstructuredSemi-structuredAll the above
3Vs
5
Big dataScaling up on RDBMs
• Partitioning
• Materialized Views
• In memory cache
• …and who are we kidding here!
RDBMS Yodabytes handle cannot!
6
Jan 1990
Big dataRDBMS Cluster
SQL
Jan 1990Feb
1990
SQL
Mar 1990
SQL
Apr 1990
SQL
May 1990
SQL
Jun 1990
SQL
Jul 1990
SQL
Aug 1990
SQL
Jun 2013
SQL
…
Controller
7
Big data - Hadoop
9
Big data – Hadoop benefits
Scalable storage
Massive parallel processing
Cost effective
10
Hadoop operational use cases
Staging
Warehousing
Archiving
1 2 3
Not glamorous, but highly effective.
11
Today’s solutions
Analytics
OLTPData
Warehouse
12
Log-based CDC Replication
• Near real-time log-based CDC from Oracle
• Applying Changes to Hadoop
13
Redo/Archive logs
Log-based CDC from Oracle-to-Oracle Architecture
Source Target
Export queue
Post queue
SQL
Post
Capture
Read
Export Import
Capture queue
14
Log-based CDC Replication – impact-free and limitless!
15
Capturequeue
Postqueue
Log-based CDC Data Integration Architecture
Target(s)
Capture
Read
JMS post
…And more
Combined source & target process implementation
Near real-time data integration
Custom App
Dell App
Oracle source
Redo/Archive logs
JMS queue
JMS queue
16
JMS queue
Log-based CDC Database Replication & Near Real-time Data Integration Summary
Source Target(s)
…And more
Near real-time data integration Custom app
Database replication
17
Connector for Hadoop
• Provides near real-time data replication from Oracle to Hadoop environments. The solution enables organizations to affordably replicate live data from Oracle tables
– In near real time to HDFS and Hive environments
– In real time to HBase
18
HBase HDFS
19
SQOOP
JMS
HBase HDFS
20
SQOOP
JMS
HBase HDFS
21
JMS
HBase HDFS
22
HBase HDFS
23
HBase HDFS
24
Log-based CDC
Connector for HadoopJMS
HBase HDFS
25
SharePlex for Oracle
Connector for HadoopJMS
HBase HDFS
26
SharePlex for Oracle
Connector for HadoopJMS
HBase HDFS
27
SharePlex for Oracle
Connector for HadoopJMS
HBase HDFS
28
SharePlex for Oracle
Connector for HadoopJMS
HBase HDFS
29
Log-based CDC
Connector for HadoopJMS
HBase HDFS
30
Log-based CDC
Connector for HadoopJMS
HBase HDFS
31
Log-based CDC
Connector for HadoopJMS
HBase HDFS
32
Log-based CDC
Connector for HadoopJMS
HBase HDFS
33
Log-based CDC
Connector for HadoopJMS
HBase HDFS
34
Log-based CDC
Connector for HadoopJMS
HBase HDFS
35
Log-based CDC
Connector for HadoopJMS
HBase HDFS
36
Log-based CDC
Connector for HadoopJMS
HBase HDFS
37
Log-based CDC
Connector for HadoopJMS
HBase HDFS
38
Log-based CDC
SQOOP
Connector for HadoopJMS
HBase HDFS
SharePlex Connector for Hadoop architecture
39
Siebel CRM
PeopleSoftHR
SAPManufacturing
OracleFinancials
Data warehouse, stage and archive
Reporting Dashboards
Analytics
SharePlex Connector for Hadoop – use case
...
40
Questions
41