AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

transcript

Greg Brandt, Liyin Tang (Airbnb)

December 2, 2016

Streaming ETLFor Amazon RDS and Amazon DynamoDB

DAT315

What to Expect from the Session

• Database Change Data Capture (CDC)

• Improving ETL to Data Warehouse

Spinaltap (CDC)

Architectural Evolution

From monolithic Rails app

Too many specialized

services/data stores

New Challenges

• Co-processing logic breaks down out of process/transaction context

• Primary tables/indices on many machines, not single RDBMS

• Specialized systems needed for certain use cases (analytics, search,

Architectural Tenants

• Build for production

• Plan for the future, build for today

• Prefer existing solutions and patterns that we have

experience with in production

• Services should own their data and not share their

storage

• Mutations to data should be propagated via

standardized events

Change Data Capture (CDC)

Goal: Provide streams of data mutations

• In near real time

• With timeline consistency

To keep all these systems in sync

Option 1: Application-Driven Dual Writes

• Consistency hard

• (2PC/consensus needed)

• Data model easy

• (Schema controlled by application)

• Development easy

• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS

Option 2: Database Log Mining

• Consistency easy

• (Leverage commit log semantics)

• Parsing/Data model hard

• (Database’s internal commit log)

We Chose Database Log Mining

• Parsing is easier than consensus

• Many libraries/APIs exist to make parsing easy

• Consuming stream of commits gives timeline

consistency by default

Data Ecosystem

Requirements

• Timeline consistency with at-least-once message

delivery

• Easily add new sources to consume (new machines if

necessary)

• Support low latency and high throughput use cases

• High availability with automatic failover

• Heterogeneous data sources (MySQL, Amazon

DynamoDB)

MySQL Commit Log

• Java library for binary log parsing • https://github.com/shyiko/mysql-binlog-

connector-java/

• Emit mutation events • (Write_rows, Update_rows, Delete_rows)

• Logical clock determined from binlog

file/offset • (Single-master, Multi-AZ setup)

• Leverage XidEvent for transaction

boundary metadata/checkpointing• (InnoDB implementation detail)

DynamoDB Streams

• Using DynamoDB Streams Kinesis

Adapter

• Guarantees• Each stream record appears exactly once

in the stream.

• Stream records appear in the same

sequence as the actual modifications to

the item

• Monotonically increasing logical clock

is hard• Need to incorporate shard id, parent/child

splitting semantics

• SequenceNumber is not global

Abstract Mutation

• Provide monotonically increasing* id

from logical clock

• Source-specific metadata (e.g. MySQL

binlog filename/offset)

• The beforeImage of the row in DB

(possibly null)

• The afterImage of the row in DB

(possibly null)

• Encode this using source-agnostic

format (e.g. Thrift)

• Write this object to message bus (e.g.

Kafka)

id: Long,

opCode: [

INSERT,

UPDATE,

DELETE

metadata: Map<String, String>,

beforeImage: Record,

afterImage: Record

Clustering/Configuration

• LEADER/STANDBY state model

• Each machine is LEADER for a subset of

sources

• Workload distributed evenly

• Use ZooKeeper-based Apache Helix

framework for cluster management

• http://helix.apache.org/

• Dynamic source configuration changes

• Helix Instance group tags to separate

MySQL/DynamoDB nodes

Fault Tolerance

• Controller handles node failure/elects

new LEADER for sources

• Maintain leader_epoch counter in Helix

ZooKeeper property store

• Prefix generated ids with leader_epoch

for monotonicity

• E.g. (leader_epoch, binlog_file,

binlog_pos)

Pub/Sub

• Produce mutations to Kafka with

durable configuration*

• Async coprocessors consume

messages, produce new streams

• Model streaming library allows

encapsulation of DB table schema• Service controls both API endpoint and

streaming view of data

• Keep 24 hours of MySQL binlog• Alert / rewind on failures in this tier

Online Validation

• Download binlog after it is flushed/immutable

• Check for holes/ordering violations by consuming stream from Kafka

• Allows us to maintain low latency with confidence in consistency of stream

• Auto-healing• Reset binlog position to earlier if too many failures

Production Lessons

• Need schema history store for regions of commit log to support rewind• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain

range/schema mapping

• Be careful about table encodings! (latin1, utf8...)

• request.required.acks = all can potentially hit every broker…• (Group produce requests by broker to avoid hitting too many)

• Per-source produce buffer size• (Tune for throughput/latency)

Data Ecosystem

Streaming DB Exports

Batch Infrastructure

Airflow Scheduling

Events

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Growing Pain

Airflow Scheduling

Events

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Point-in-Time Restore based DB Export

• Pros:

• Simple

• Especially for schema change

• Consistent

• Cons:

• No SLA for RDS PITR restoration time

• No near real time ad hoc query

• No hourly snapshot

• High storage cost

Overviews

Real-Time Ingestion on HBase

HBase HDFSSpinaltap

Query Engines: Hive/Presto/Spark

Streaming

Real time

snapshot

Access Data in HBase

HBase HDFS

Streaming:

snapshot

Unified view on real time data

Interactive Query:

Presto

Batch Job:

Hive/Spark

Snapshot & Reseed

HBase HDFS

Snapshot

(Hfile Links)

Bulk upload

(Reseed)

Onboard New Tables

Streaming of Mutations from SpinalTap

Reseed

Ingest

Disaster Recovery - Checkpoint

Reseed

Ingest

Disaster Recovery - Rewind

Reseed

Ingest

Disaster Recovery - Reseed

Reseed

Ingest

HBase Schema

Key Space Design

• Multiplex all DB tables on Single HBase Table

• Fast point look up based on primary keys

• Efficient sequential scans for one table

• Load balance

HBase Row Keys – Primary Keys

• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)

• Row Key = Hash Key + DB_TABLE + PK1=v1 +

Pk2=v2

• Fast point lookup based on primary keys

• Efficient sequential scan for all the keys in same

DB/Table

• Balanced based on hash key

Hash DB_TABLE PK1=v1 PK2=v2

HBase Row Keys – Secondary Keys

• Hash Key= md5(DB_TABLE, Index_1=v1)

• Row Key = Hash Key + DB_TABLE + Index_1=v1 +

PK1=vpk1

• Prefix scan for a given secondary index

Hash DB_TABLE Index=v1 PK1=vpk1

HBase Versioning

Rows CF: Columns Version Value

<ShardKey><DB_TABLE_#1><

PK_a=A>id Fri May 19 00:33:19 2016 101

PK_a=A>city Fri May 19 00:33:19 2016 San Francisco

PK_a=A>city Fri May 10 00:34:19 2016 New York

PK_a=A’>id Fri May 19 00:33:19 2016 1

Version by Timestamp

Binlog Order

COMMIT_T

S: 101

COMMIT_T

S: 102

COMMIT_T

S: 103

COMMIT_T

S: N’…

Version by Timestamp

Binlog Order

COMMIT_T

S: N’…

mysql-

bin.00000:1

mysql-

bin.00000:1

mysql-

bin.00000:1

mysql-

bin.00000:

HBase Versioning

Rows CF: Columns Version Commit TS

PK_a=A>id mysql-bin.00000:100 T0

PITR Semantics

Binlog Order

COMMIT_T

S: 101

COMMIT_T

S: 103

COMMIT_T

S: 102

COMMIT_T

S: N’…

PITR Semantics: Binlog Commit Time Index

Rows Version (Logical Offset) Value

2016-05-23 23><100>100 mysql-bin.00000:100

2016-05-23 23><101>101 mysql-bin.00000:101

2016-05-23 23><103>103 mysql-bin.00000:103

<ShardKey><DB_TABLE_#1><2

016-05-24 00><102>102 mysql-bin.00000:102

First mutation

across PITR

The last

mutation before

Streaming DB Export

• Pros:

• Consistent

• High SLA for the daily snapshot

• Consistent as PITR semantics

• Near real time ad hoc query

• Hive/Spark compatible

• Hourly snapshot view

• Low storage cost

• Cons:

• Schema change

Thank you!

Remember to complete

your evaluations!

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Technology