Date post: | 15-Apr-2017 |
Category: |
Engineering |
Upload: | confluent |
View: | 1,461 times |
Download: | 1 times |
Distributed Data Systems 1 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Database Replication with KafkaTom QuigglePrincipal Staff Software [email protected]/in/tquiggle@TomQuiggle
Distributed Data Systems 2 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Overview– Architecture– GTIDs and SCNs– Per-instance replication (0.8)– Per-partition replication (1.0)
Kafka Per-Partition Replication– Requirements– Kafka Configuration– Message Protocol– Producer– Consumer
Q&A
Agenda
Distributed Data Systems 3 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO OverviewESPRESSO Database Replication with Kafka
Distributed Data Systems 4 ©2016 LinkedIn Corporation. All Rights Reserved.
Hosted, Scalable, Data as a Service (DaaS) for LinkedIn’s Online Structured Data Needs
Databases are partitioned Partitions distributed across available hardware HTTP proxy routes requests to appropriate database node Apache Helix provides centralized cluster management
ESPRESSO1
1. Elastic, Scalable, Performant, Reliable, Extensible, Stable, Speedy and Operational
Distributed Data Systems 5 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Architecture
Distributed Data Systems 6 ©2016 LinkedIn Corporation. All Rights Reserved.
GTIDs and SCNs
MySQL 5.6 Global Transaction Identifier Unique, monotonically increasing, identifier for each transaction
committed GTID :== source_id:transaction_id ESPRESSO conventions
– source_id encodes database name and partition number– transaction_id is a 64 bit numeric value
High Order 32 bits is generation count Low order 32 bit are sequence within generation
– Generation increments with every change in mastership– Sequence increases with each transaction– We refer to a transaction_id component as a Sequence Commit Number (SCN)
Distributed Data Systems 7 ©2016 LinkedIn Corporation. All Rights Reserved.
GTIDs and SCNs
Example binlog transaction:SET @@SESSION.GTID_NEXT= 'hash(db_part):(gen<<32 + seq)';SET TIMESTAMP=<seconds_since_Unix_epoch>BEGINTable_map: `db_part`.`table1` mapped to number 1234Update_rows: table id 1234BINLOG '...'BINLOG '...'Table_map: `db_part`.`table2` mapped to number 5678Update_rows: table id 5678BINLOG '...'COMMIT
Distributed Data Systems 8 ©2016 LinkedIn Corporation. All Rights Reserved.
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
ESPRESSO: 0.8 Per-Instance Replication
Master
Slave
Offline
Distributed Data Systems 9 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO: 0.8 Per-Instance Replication
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
Master
Slave
Offline
Distributed Data Systems 10 ©2016 LinkedIn Corporation. All Rights Reserved.
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
Master
Slave
Offline
ESPRESSO: 0.8 Per-Instance Replication
Distributed Data Systems 11 ©2016 LinkedIn Corporation. All Rights Reserved.
Issues with Per-Instance Replication
Poor resource utilization – only 1/3 of nodes service application requests
Partitions unnecessarily share fate Cluster expansion is an arduous process Upon node failure, 100% of the traffic is redirected to one node
Distributed Data Systems 12 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO: 1.0 Per-Partition ReplicationPer-Instance MySQL replication replaced with Per-Partition Kafka
HELIX
P4:Master: 1Slave: 3…
EXTERNALVIEW
Node 1Node 2Node 3
LIVEINSTANCESNode 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Kafka
Distributed Data Systems 13 ©2016 LinkedIn Corporation. All Rights Reserved.
Cluster ExpansionInitial State with 12 partitions, 3 storage nodes, r=2
HELIX
EXTERNALVIEW
Node 1Node 2Node 3
LIVEINSTANCESNode 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Master
Slave
Offline
P4:Master: 1Slave: 3…
Distributed Data Systems 14 ©2016 LinkedIn Corporation. All Rights Reserved.
Cluster ExpansionAdding Node: Helix Sends OfflineToSlave for new partitions
HELIX
EXTERNALVIEW
Node 1Node 2Node 3Node 4
LIVEINSTANCESNode 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Node 4
P4 P8
P1
P12
P7 P9
P4:Master: 1Slave: 3Offline: 4…
Distributed Data Systems 15 ©2016 LinkedIn Corporation. All Rights Reserved.
Cluster ExpansionOnce a new partition is ready, transfer ownership and drop old
HELIX
EXTERNALVIEW
Node 1Node 2Node 3Node 4
LIVEINSTANCESNode 1
P1 P2 P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Node 4
P4 P8
P1
P12
P7 P9
P4:Master: 4Slave: 3…
Distributed Data Systems 16 ©2016 LinkedIn Corporation. All Rights Reserved.
Cluster ExpansionContinue migration of master and slave partitions
HELIX
EXTERNALVIEW
Node 1Node 2Node 3Node 4
LIVEINSTANCESNode 1
P1 P2 P3
P5 P6
P9 P10
Node 2
P5 P6 P7
P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Node 4
P4 P8
P1
P12
P7 P9
P9:Master: 3Slave: 1Offline: 4…
Distributed Data Systems 17 ©2016 LinkedIn Corporation. All Rights Reserved.
Cluster ExpansionRebalancing is complete after last partition migration
HELIX
EXTERNALVIEW
Node 1Node 2Node 3Node 4
LIVEINSTANCESNode 1 Node 2
Node 3 Node 4
P4 P8
P1
P12
P7 P9
P5 P6
P2
P7
P11 P12
P9 P10
P3
P11
P4 P8
P1 P2
P5
P3
P6 P10
P9:Master: 4Slave: 3…
Distributed Data Systems 18 ©2016 LinkedIn Corporation. All Rights Reserved.
Node FailoverDuring failure or planned maintenance, promote slaves to master
HELIX
EXTERNALVIEW
Node 1Node 2Node 3Node 4
LIVEINSTANCESNode 1 Node 2
Node 3 Node 4
P4 P8
P1
P12
P7 P9
P5 P6
P2
P7
P11 P12
P9 P10
P3
P11
P4 P8
P1 P2
P5
P3
P6 P10
P9:Master: 4Slave: 3…
Distributed Data Systems 19 ©2016 LinkedIn Corporation. All Rights Reserved.
Node FailoverDuring failure or planned maintenance, promote slaves to master
HELIX
EXTERNALVIEW
Node 1Node 2Node 4
LIVEINSTANCESNode 1 Node 2
Node 3 Node 4
P4 P8
P1
P12
P7 P9
P5 P6
P2
P7
P11 P12
P9 P10
P3
P11
P4 P8
P1 P2
P5
P3
P6 P10
P9:Master: 4Offline: 3…
Distributed Data Systems 20 ©2016 LinkedIn Corporation. All Rights Reserved.
Advantages of Per-Partition Replication
Better hardware utilization– All nodes service application requests
Mastership hand-off done in parallel After node failure, can restore full replication factor in parallel Cluster expansion is as easy as:
– Add node(s) to cluster– Rebalance
Single platform for all Change Data Capture– Internal replication– Cross-colo replication– Application CDC consumers
Distributed Data Systems 21 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Per-Partition Replication
ESPRESSO Database Replication with Kafka
Distributed Data Systems 22 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka for Internal Replication
Distributed Data Systems 23 ©2016 LinkedIn Corporation. All Rights Reserved.
Requirements
Delivery Must Be: Guaranteed In-Order Exactly Once (sort of)
Distributed Data Systems 24 ©2016 LinkedIn Corporation. All Rights Reserved.
Broker Configuration
Replication factor = 3(most LinkedIn clusters use 2)
min.isr=2 Disable unclean leader elections
Distributed Data Systems 25 ©2016 LinkedIn Corporation. All Rights Reserved.
B – Begin txnE – End txnC – Control
Message ProtocolMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104DB_0: 3:104E
Distributed Data Systems 26 ©2016 LinkedIn Corporation. All Rights Reserved.
Message Protocol – Mastership HandoffOld Master
MySQL
ProducerConsumer
Promoted Slave
MySQL
ProducerConsumer
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104DB_0: 3:104E
4:0C
Distributed Data Systems 27 ©2016 LinkedIn Corporation. All Rights Reserved.
Message Protocol – Mastership HandoffMaster
MySQL
ProducerConsumer
Promoted Slave
MySQL
ProducerConsumer
Consumed own control message
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104DB_0: 3:104E
4:0C
Distributed Data Systems 28 ©2016 LinkedIn Corporation. All Rights Reserved.
Message Protocol – Mastership HandoffOld Master
MySQL
ProducerConsumer
Master
MySQL
ProducerConsumer
Enable writes with new gen
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104DB_0: 3:104E
4:0C
4:0B
Distributed Data Systems 29 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer Configuration
acks = “all” retries = Integer.MAX_VALUE block.on.buffer.full=true max.in.flight.requests.per.connection=1 linger=0 On non-retryable exception:
– destroy producer– create new producer– resume from last checkpoint
Distributed Data Systems 30 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer CheckpointingMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104
Can’tCheckpoint
Here
Periodically writes (SCN, Kafka Offset) to MySQL tableMay only checkpoint offset at end of valid transaction!
Distributed Data Systems 31 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer CheckpointingMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104
Producer checkpoint will lag current producer Kafka OffsetKafka Offset obtained from callback
LastCheckpoint
Here
Distributed Data Systems 32 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer CheckpointingMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
Last
CheckpointHere
send()FAILS
X
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104 3:104
Distributed Data Systems 33 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer CheckpointingMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
Recreate producer and resume from last checkpoint
Resume From
Checkpoint
Messages will be replayed
3:102B
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104
Distributed Data Systems 34 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Producer CheckpointingMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
Kafka stream now contains replayed transactions(possibly including partial transactions)
Can Checkpoint
Here
Replayed Messages
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104 3:102B
3:102 3:102E
3:103B,E
3:104B
3:104
Distributed Data Systems 35 ©2016 LinkedIn Corporation. All Rights Reserved.
Partition 3
Kafka Consumer
Uses Low Level Consumer Consume Kafka partitions slaved on node
Partition 1
Partition 2
Kafka Broker A
Kafka Broker B
Kafka Consumer
poll()
Consumer Thread
EspressoKafkaConsumer
EspressoReplicationApplier
MySQL
P1
P2
P3
ApplierThreads
Distributed Data Systems 36 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka ConsumerMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104
Slave updates (SCN, Kafka Offset) row for every committed txn
3:101@2
Distributed Data Systems 37 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka ConsumerMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
Client only applies messages with SCN greater than last committed
Replayed Messages
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104 3:102B
3:102 3:102E
3:103B,E
3:104B
3:104
BEGINTransaction
3:104
3:103@6
Distributed Data Systems 38 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka ConsumerMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
Incomplete transaction is rolled back
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104 3:102B
3:102 3:102E
3:103B,E
3:104B
3:104
ROLLBACK3:104
Replayed Messages
3:103@6
Distributed Data Systems 39 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka ConsumerMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
Client only applies messages with SCN greater than last committed
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104 3:102B
3:102 3:102E
3:104B
3:104
SKIP3:102..3:10
3
Replayed Messages
3:103@6
3:103B,E
Distributed Data Systems 40 ©2016 LinkedIn Corporation. All Rights Reserved.
Kafka ConsumerMaster
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
3:101B,E
3:102B
3:102 3:102E
3:100B,E
3:103B,E
3:104B
3:104 3:102B
3:102 3:102E
3:104B
3:104
Replayed Messages
BEGIN3:104
(again)
3:104E
3:103@6
3:103B,E
Distributed Data Systems 41 ©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write Filtering
What if stalled master continues writing after transition?
Distributed Data Systems 42 ©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write FilteringMASTER
MySQL
ProducerConsumer
Slave
MySQL
ProducerConsumer
3:102B
3:102 3:102E
3:103B,E
3:104B
3:104
Master Stalled
Distributed Data Systems 43 ©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write FilteringMaster
MySQL
ProducerConsumer
Promoted Slave
MySQL
ProducerConsumer
3:102B
3:102 3:102E
3:103B,E
3:104B
3:104
Master Stalled4:0C
Helix sends SlaveToMaster transition to one of the slaves
Distributed Data Systems 44 ©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write FilteringMaster
MySQL
ProducerConsumer
New Master
MySQL
ProducerConsumer
Master Stalled3:102
B3:102 3:102
E3:103B,E
3:104B
3:104 4:0C
4:1B,E
4:2B
4:2E
Slave becomes master and starts taking writes
Distributed Data Systems 45 ©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write FilteringMaster
MySQL
ProducerConsumer
New Master
MySQL
ProducerConsumer
3:102B
3:102 3:102E
3:103B,E
3:104B
3:104 4:0C
4:1B,E
4:2B
4:2E
3:104E
3:105B,E
Stalled Master resumes and sends binlog entries to Kafka
Distributed Data Systems 46 ©2016 LinkedIn Corporation. All Rights Reserved.
Zombie Write FilteringERROR
MySQL
ProducerConsumer
New Master
MySQL
ProducerConsumer
3:102B
3:102 3:102E
3:103B,E
3:104B
3:104 4:0C
4:1B,E
4:2B
4:2E
3:104E
3:105B,E
4:3B,E
Former master goes into ERROR stateZombie writes filtered by all consumers based on increasing SCN rule
Distributed Data Systems 47 ©2016 LinkedIn Corporation. All Rights Reserved.
Current Status
ESPRESSO Database Replication with Kafka
Distributed Data Systems 48 ©2016 LinkedIn Corporation. All Rights Reserved.
ESPRESSO Kafka Replication: Current Status
Pre-Production integration environment migrated to Kafka replication 8 production clusters migrated (as of 4/11) Migration will continue through Q3 of 2016 Average replication latency < 90ms
Distributed Data Systems 49 ©2016 LinkedIn Corporation. All Rights Reserved.
Conclusions
Configure Kafka for reliable, at least once, delivery. See:
http://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
Carefully control producer and consumer checkpoints along txn boundaries
Embed sequence information in message stream to implement exactly-once application of messages
Distributed Data Systems
Even our workspace isHorizontally Scalable!