+ All Categories
Home > Data & Analytics > Real time ETL processing using Spark streaming

Real time ETL processing using Spark streaming

Date post: 16-Apr-2017
Category:
Upload: datamantra
View: 916 times
Download: 1 times
Share this document with a friend
13
Real Time ETL processing By Veeramani Moorthy
Transcript
Page 1: Real time ETL processing using Spark streaming

Real Time ETL processing

By Veeramani Moorthy

Page 2: Real time ETL processing using Spark streaming

Agenda

Real time ETL Architecture

Why Reconciler?

Reconciler Data model

Q & A?

Requirements for Reconciler

Page 3: Real time ETL processing using Spark streaming

[1.2

.1]

JDB

C F

etch

Tab

le S

chem

a

Trail Files

AdapterRead

GoldenGate

Schema Registry[1.1] Data

Pump

• Schema Registry is a repository of ALL schemas which are versioned.• GoldenGate captures the table change events• Kafka – Distributed Messaging system• CDC – Change Data Capture

[2.1] CDC Events to

broker

Spark Reconciler Spark Joiner

Get Table Schema Get Table Schema

Streaming Reconciler

job

Write output

Reconciled Companies Topic

Source DB

Golden Gate

[1.0] Data Extract

[1.2

] G

et/

Cre

ate

/Up

dat

e Sc

hem

a

Real-Time ETL Architecture

Companies Topic

Addresses Topic

Streaming Joiner/Transfo

rmer Job

Streaming Reconciler

jobReconciled

Addresses Topic

Read/Write for Reconcile Addresses

Read/Write for Reconcile Companies

[3.1] CDC Events to

broker

Streaming Joiner/Transfo

rmer Job

fn

Mapping service

Get Mapping

Page 4: Real time ETL processing using Spark streaming

Requirements for Reconciler

Support for Idempotency

Support for immutability

Support for Schema evolution

Support to handle out of order CDC events

Page 5: Real time ETL processing using Spark streaming

Challenges in Spark streaming

Page 6: Real time ETL processing using Spark streaming

Out of sequence

UPDATE comes first INSERT comes later

Page 7: Real time ETL processing using Spark streaming

Challenges in Spark streaming …

Page 8: Real time ETL processing using Spark streaming

Data model

Tuple Id Source DB Timestamp

Attribute Name Attribute value isDelete?

10201 12345677 company_id 10201 false

10201 12345677 company_name ABC Inc false

10201 12345677 company_addr EGL, BLR false

10201 22345677 company_addr Ecospace, BLR false

….

Company_id Company_name Company_addr

10201 ABC Inc EGL, BLR

….

Instead of

Go with

Page 9: Real time ETL processing using Spark streaming

How does it solve?

Immutability?

Idempotency?

Out of sequence events?

Page 10: Real time ETL processing using Spark streaming

Schema Evolution

Tuple Id Source DB Timestamp

Attribute Name Attribute value isDelete?

10201 12345677 company_id 10201 false

10201 12345677 company_name ABC Inc false

10201 12345677 company_addr EGL, BLR false

10201 22345677 company_addr Ecospace, BLR false

10201 22345900 Registered_name

ABC India Pvt Ltd

false

….

Do I have to change the destination schema?

Page 11: Real time ETL processing using Spark streaming

Schema Evolution

Addition of new column

Deletion of an existing column

Data Type change

Page 12: Real time ETL processing using Spark streaming
Page 13: Real time ETL processing using Spark streaming

Recommended