Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 1 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO Database Replication with KafkaTom QuigglePrincipal Staff Software [email protected]/in/tquiggle@TomQuiggle


ESPRESSO Overview– Architecture– GTIDs and SCNs– Per-instance replication (0.8)– Per-partition replication (1.0)

Kafka Per-Partition Replication– Requirements– Kafka Configuration– Message Protocol– Producer– Consumer

Q&A

Agenda


ESPRESSO OverviewESPRESSO Database Replication with Kafka


Hosted, Scalable, Data as a Service (DaaS) for LinkedIn’s Online Structured Data Needs

Databases are partitioned Partitions distributed across available hardware HTTP proxy routes requests to appropriate database node Apache Helix provides centralized cluster management

ESPRESSO1

1. Elastic, Scalable, Performant, Reliable, Extensible, Stable, Speedy and Operational


ESPRESSO Architecture


GTIDs and SCNs

MySQL 5.6 Global Transaction Identifier Unique, monotonically increasing, identifier for each transaction

committed GTID :== source_id:transaction_id ESPRESSO conventions

– source_id encodes database name and partition number– transaction_id is a 64 bit numeric value

High Order 32 bits is generation count Low order 32 bit are sequence within generation

– Generation increments with every change in mastership– Sequence increases with each transaction– We refer to a transaction_id component as a Sequence Commit Number (SCN)


GTIDs and SCNs

Example binlog transaction:SET @@SESSION.GTID_NEXT= 'hash(db_part):(gen<<32 + seq)';SET TIMESTAMP=<seconds_since_Unix_epoch>BEGINTable_map: `db_part`.`table1` mapped to number 1234Update_rows: table id 1234BINLOG '...'BINLOG '...'Table_map: `db_part`.`table2` mapped to number 5678Update_rows: table id 5678BINLOG '...'COMMIT


Node 1

P1 P2 P3

Node 2

P1 P2 P3

Node 3

P1 P2 P3

Node 1

P4 P5 P6

Node 2

P4 P5 P6

Node 3

P4 P5 P6

ESPRESSO: 0.8 Per-Instance Replication

Master

Slave

Offline



Node 1

P1 P2 P3

Node 2

P1 P2 P3

Node 3

P1 P2 P3

Node 1

P4 P5 P6

Node 2

P4 P5 P6

Node 3

P4 P5 P6

Master

Slave

Offline


Node 1

P1 P2 P3

Node 2

P1 P2 P3

Node 3

P1 P2 P3

Node 1

P4 P5 P6

Node 2

P4 P5 P6

Node 3

P4 P5 P6

Master

Slave

Offline



Issues with Per-Instance Replication

Poor resource utilization – only 1/3 of nodes service application requests

Partitions unnecessarily share fate Cluster expansion is an arduous process Upon node failure, 100% of the traffic is redirected to one node


ESPRESSO: 1.0 Per-Partition ReplicationPer-Instance MySQL replication replaced with Per-Partition Kafka

HELIX

P4:Master: 1Slave: 3…

EXTERNALVIEW

Node 1Node 2Node 3

LIVEINSTANCESNode 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Kafka


Cluster ExpansionInitial State with 12 partitions, 3 storage nodes, r=2

HELIX

EXTERNALVIEW

Node 1Node 2Node 3

LIVEINSTANCESNode 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline



Cluster ExpansionAdding Node: Helix Sends OfflineToSlave for new partitions

HELIX

EXTERNALVIEW

Node 1Node 2Node 3Node 4

LIVEINSTANCESNode 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Node 4

P4 P8

P1

P12

P7 P9

P4:Master: 1Slave: 3Offline: 4…


Cluster ExpansionOnce a new partition is ready, transfer ownership and drop old

HELIX

EXTERNALVIEW


LIVEINSTANCESNode 1

P1 P2 P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Node 4

P4 P8

P1

P12

P7 P9



Cluster ExpansionContinue migration of master and slave partitions

HELIX

EXTERNALVIEW


LIVEINSTANCESNode 1

P1 P2 P3

P5 P6

P9 P10

Node 2

P5 P6 P7

P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Node 4

P4 P8

P1

P12

P7 P9

P9:Master: 3Slave: 1Offline: 4…


Cluster ExpansionRebalancing is complete after last partition migration

HELIX

EXTERNALVIEW


LIVEINSTANCESNode 1 Node 2

Node 3 Node 4

P4 P8

P1

P12

P7 P9

P5 P6

P2

P7

P11 P12

P9 P10

P3

P11

P4 P8

P1 P2

P5

P3

P6 P10



Node FailoverDuring failure or planned maintenance, promote slaves to master

HELIX

EXTERNALVIEW



Node 3 Node 4

P4 P8

P1

P12

P7 P9

P5 P6

P2

P7

P11 P12

P9 P10

P3

P11

P4 P8

P1 P2

P5

P3

P6 P10



Node FailoverDuring failure or planned maintenance, promote slaves to master

HELIX

EXTERNALVIEW

Node 1Node 2Node 4


Node 3 Node 4

P4 P8

P1

P12

P7 P9

P5 P6

P2

P7

P11 P12

P9 P10

P3

P11

P4 P8

P1 P2

P5

P3

P6 P10

P9:Master: 4Offline: 3…


Advantages of Per-Partition Replication

Better hardware utilization– All nodes service application requests

Mastership hand-off done in parallel After node failure, can restore full replication factor in parallel Cluster expansion is as easy as:

– Add node(s) to cluster– Rebalance

Single platform for all Change Data Capture– Internal replication– Cross-colo replication– Application CDC consumers


Kafka Per-Partition Replication

ESPRESSO Database Replication with Kafka


Kafka for Internal Replication


Requirements

Delivery Must Be: Guaranteed In-Order Exactly Once (sort of)


Broker Configuration

Replication factor = 3(most LinkedIn clusters use 2)

min.isr=2 Disable unclean leader elections


B – Begin txnE – End txnC – Control

Message ProtocolMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E


Message Protocol – Mastership HandoffOld Master

MySQL

ProducerConsumer

Promoted Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

4:0C


Message Protocol – Mastership HandoffMaster

MySQL

ProducerConsumer

Promoted Slave

MySQL

ProducerConsumer

Consumed own control message

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

4:0C


Message Protocol – Mastership HandoffOld Master

MySQL

ProducerConsumer

Master

MySQL

ProducerConsumer

Enable writes with new gen

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

4:0C

4:0B


Kafka Producer Configuration

acks = “all” retries = Integer.MAX_VALUE block.on.buffer.full=true max.in.flight.requests.per.connection=1 linger=0 On non-retryable exception:

– destroy producer– create new producer– resume from last checkpoint


Kafka Producer CheckpointingMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Can’tCheckpoint

Here

Periodically writes (SCN, Kafka Offset) to MySQL tableMay only checkpoint offset at end of valid transaction!



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Producer checkpoint will lag current producer Kafka OffsetKafka Offset obtained from callback

LastCheckpoint

Here



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Last

CheckpointHere

send()FAILS

X

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:104



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Recreate producer and resume from last checkpoint

Resume From

Checkpoint

Messages will be replayed

3:102B

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Kafka stream now contains replayed transactions(possibly including partial transactions)

Can Checkpoint

Here

Replayed Messages

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:103B,E

3:104B

3:104


Partition 3

Kafka Consumer

Uses Low Level Consumer Consume Kafka partitions slaved on node

Partition 1

Partition 2

Kafka Broker A

Kafka Broker B

Kafka Consumer

poll()

Consumer Thread

EspressoKafkaConsumer

EspressoReplicationApplier

MySQL

P1

P2

P3

ApplierThreads


Kafka ConsumerMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Slave updates (SCN, Kafka Offset) row for every committed txn

3:101@2



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Client only applies messages with SCN greater than last committed

Replayed Messages

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

BEGINTransaction

3:104

3:103@6



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Incomplete transaction is rolled back

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

ROLLBACK3:104

Replayed Messages

3:103@6



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Client only applies messages with SCN greater than last committed

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:104B

3:104

SKIP3:102..3:10

3

Replayed Messages

3:103@6

3:103B,E



MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:104B

3:104

Replayed Messages

BEGIN3:104

(again)

3:104E

3:103@6

3:103B,E


Zombie Write Filtering

What if stalled master continues writing after transition?


Zombie Write FilteringMASTER

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

Master Stalled


Zombie Write FilteringMaster

MySQL

ProducerConsumer

Promoted Slave

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

Master Stalled4:0C

Helix sends SlaveToMaster transition to one of the slaves



MySQL

ProducerConsumer

New Master

MySQL

ProducerConsumer

Master Stalled3:102

B3:102 3:102

E3:103B,E

3:104B

3:104 4:0C

4:1B,E

4:2B

4:2E

Slave becomes master and starts taking writes



MySQL

ProducerConsumer

New Master

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104 4:0C

4:1B,E

4:2B

4:2E

3:104E

3:105B,E

Stalled Master resumes and sends binlog entries to Kafka


Zombie Write FilteringERROR

MySQL

ProducerConsumer

New Master

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104 4:0C

4:1B,E

4:2B

4:2E

3:104E

3:105B,E

4:3B,E

Former master goes into ERROR stateZombie writes filtered by all consumers based on increasing SCN rule


Current Status

ESPRESSO Database Replication with Kafka


ESPRESSO Kafka Replication: Current Status

Pre-Production integration environment migrated to Kafka replication 8 production clusters migrated (as of 4/11) Migration will continue through Q3 of 2016 Average replication latency < 90ms


Conclusions

Configure Kafka for reliable, at least once, delivery. See:

http://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844

Carefully control producer and consumer checkpoints along txn boundaries

Embed sequence information in message stream to implement exactly-once application of messages

Distributed Data Systems

Even our workspace isHorizontally Scalable!

Date post:	15-Apr-2017
Category:	Engineering
Upload:	confluent
View:	1,461 times
Download:	1 times

Espresso Database Replication with Kafka, Tom Quiggle

Engineering