Post on 28-Jul-2015
transcript
@crichardson
Distributed system goodies: Zookeeper, Helix and Kafka
Chris Richardson
Author of POJOs in Action Founder of the original CloudFoundry.com
@crichardson chris@chrisrichardson.net http://plainoldobjects.com http://microservices.io
@crichardson
Presentation goal
Talk about a collection of interesting technologies for building distributed
systems
@crichardson
About Chris
@crichardson
About Chris
Founder of a startup that’s creating a platform for developing
event-driven microservices (http://bit.ly/trialeventuate)
@crichardson
For more information
https://github.com/cer/event-sourcing-examples
https://github.com/cer/microservices-examples
http://microservices.io
http://plainoldobjects.com/
https://twitter.com/crichardson
@crichardson
Agenda
Zookeeper
Helix
Kafka
@crichardson
Apache ZooKeeper is an open source distributed configuration service, synchronization service, and naming registry for large distributed systems
https://zookeeper.apache.org/
@crichardson
Distributed system use cases…
Name service
lookup by name,
e.g. service discovery: name => [host, port]*
Group membership
E.g. distributed cache
Cluster members need to talk amongst themselves
Clients need to discover the group members
@crichardson
…Use casesLeader election
N servers, one of which needs to be the master
e.g. master/slave replication
Distributed locking and latches
e.g. cluster wide singleton
Queues
…
@crichardson
Zookeeper serverIn-
memory DB
datadirsnapshot logs
txn logs
Zookeeper serverIn-
memory DB
datadirsnapshot logs
txn logs
Zookeeper serverIn-
memory DB
datadirsnapshot logs
txn logs
ZAB ZAB
Client
Majority-based
Leader FollowerFollower
@crichardson
Zookeeper clientsLanguages:
Ships with Java, C, Perl, and Python
Community: Scala, NodeJS, Go, Lua, …
Client connects to one of a list of servers
Client establishes a session
Survives TCP disconnects
Client-specified session timeout
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZKClientBindings
Zookeeper data modelHierarchical tree of named znodes
Znodes have binary data and children
Znodes can be ephemeral - live for as long as the client session
Clients can watch a node - get notified of changes
@crichardson
Zookeeper operationscreate(path, data, mode)
Persistence or ephemeral?
Sequential: append parent’s counter value to name?
delete(path)
exists(path)
readData(path, watch?) : Object
writeData(path, data)
getChildren(path, watch?) : List[String]
@crichardson
Znode watches
readData/getChildren can establish a watch
client gets a one-time notification when changed
@crichardson
Using the zkCli$ bin/zkCli.sh -server $DOCKER_HOST_IP [zk] create /cer x Created /cer [zk] create /cer/foo y Created /cer/foo
[zk] get /cer/foo watch y
[zk] set /cer/foo z set /cer/foo z
WatchedEvent state:SyncConnected type:NodeDataChanged path:/cer/foo
@crichardson
Creating an ephemeral sequential node
[zk] create -s -e /cer/baz aa Created /cer/baz0000000001]
[zk] ls /cer watch ls /cer watch [baz0000000001, foo]
[Zk] exit
WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/cer [zk] ls /cer watch ls /cer watch [foo]
@crichardson
Leader election example/
myAppElection
guidA_0 hostA, portA
guidB_1 hostB, portB
guidC_2 hostC, portC
Server A guidA
val = 0
Server B guidB
val = 1
Server C guidC
val = 2
watches watches
Ephemeral/Sequential
znodes
Leader
Lowest value
@crichardson
Apache Curator
Open source library developed by Netflix
Simplifies connection management
Simplifies error handling
Implements recipes
Three projects: client, framework, and recipes
http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html
@crichardson
Netflix Exhibitor
Supervisory process for managing a Zookeeper instance
Watches a ZK instance and makes sure it is running
Performs periodic backups
Perform periodic cleaning of ZK log directory
A GUI explorer for viewing ZK nodes
A rich REST API
https://github.com/Netflix/exhibitor/wiki
@crichardson
Agenda
Zookeeper
Helix
Kafka
@crichardson
About Helix
http://helix.apache.org/
Built on Zookeeper
@crichardson
Typical distributed systems
Partitioning - e.g. use a PK (or other attribute) to choose server
Replication - for availability
State machines, e.g. master/slave replication
One replica is the master
Other replica is the slave
@crichardson
Use cases - master/slave replication
MySQL master/slave replication or MongoDB replica sets
N machines
1 master, N slaves
If the master dies then elect a new master
@crichardson
Use cases - Cassandra
Cluster consists of N nodes
Data consists of M partitions (aka vnodes)
Each partition has R replicas
Client can read/write any replica - no master/slave concept
Dynamic assignment of M*R partition replicas to N nodes
@crichardson
Use case - abstractlyCluster:
Set of N nodes (machines)
One or more resources
A resource is
partitioned and replicated
Resource has a state machine
e.g. offline/online, master/slave
State machine has constraints: 1 master replica, other replicas are slaves
Helix
dynamically assigns partitions to nodes
Manages state transitions and notifies nodes
@crichardson
Leader/standby state machine
Standby
LeaderDropped
Offline
@crichardson
Example assignmentNode 1 Node 2 Node 3
Partition 1 LEADER
Partition 1 STANDBY
Partition 3 LEADER
Partition 2 LEADER
Partition 2 STANDBY
Partition 3 STANDBY
Partition 2 OFFLINE
Partition 1 OFFLINE
Partition 3 OFFLINE
@crichardson
Post-failure assignmentNode 1 Node 2 Node 3
Partition 1 LEADER
Partition 1 STANDBY
Partition 3 LEADER
Partition 2 LEADER
Partition 2 LEADER
Partition 3 STANDBY
Partition 2 STANDBY
Partition 1 STANDBY
Partition 3 OFFLINEX
@crichardson
Helix cluster setup
val admin = new ZKHelixAdmin(ZK_ADDRESS)
admin.addStateModelDef(clusterName, STATE_MODEL_NAME, new StateModelDefinition(StateModelConfigGenerator.generateConfigForLeaderStandby()));
admin.addResource(clusterName, RESOURCE_NAME, NUM_PARTITIONS, STATE_MODEL_NAME, "AUTO")
HelixControllerMain.startHelixController(ZK_ADDRESS, clusterName, nodeInfo.nodeId.id, HelixControllerMain.STANDALONE)
@crichardson
Adding an instance to the cluster val ic = new InstanceConfig(nodeInfo.nodeId.id) ic.setHostName(nodeInfo.host) ic.setPort("" + nodeInfo.port) ic.setInstanceEnabled(true)
admin.addInstance(clusterName, ic)
admin.rebalance(clusterName, RESOURCE_NAME, NUM_REPLICAS)
Assign to newly added nodes
@crichardson
Helix - connecting to the cluster
manager = HelixManagerFactory.getZKHelixManager(clusterName, instanceName, InstanceType.PARTICIPANT, ZK_ADDRESS)
val stateModelFactory = new MyStateModelFactory val stateMach = manager.getStateMachineEngine stateMach.registerStateModelFactory(STATE_MODEL_NAME, stateModelFactory)
manager.connect()
Connect as a participant
Supply factory to create callbacks for state transitions
@crichardson
State transition callbacks
class MyStateModel(partitionName: String) extends StateModel {
def onBecomeStandbyFromOffline(message: Message, context: NotificationContext) { … }
def onBecomeLeaderFromStandby(message: Message, context: NotificationContext) { … }
…
}
class MyStateModelFactory extends StateModelFactory[StateModel] { def createNewStateModel(partitionName: String) = new MyStateModel(partitionName)
} <resourceName>_<partitionNumber>
invoked by Helix
@crichardson
More about HelixSpectators
Non-participants - don’t have resources/partitioned assigned to them
Get notified of changes to cluster
Property store
Write through cache of properties in Zookeeper
Messaging
Intra-cluster communication
…
@crichardson
Agenda
Zookeeper
Helix
Kafka
@crichardson
@crichardson
Kakfa concepts - topicClients publish messages to a topic
A topics has a name
A topic is a partitioned log
Topics live on disk
Messages have an offset within partition
Messages are kept for a retention period
@crichardson
Kafka is clusteredKafka cluster consists of N machines
Each topic partition has R replicas
1 machine is the leader (think master) for the topic partition
Clients publish/consume to/from leader
R - 1 machines are followers (think slaves)
Followers consume messages from the leader
Messages are committed when all replicas have written to the log
Producers can optionally wait for a message to be committed
Consumers only ever see committed messages
@crichardson
Kafka producers
Publish message to a topic
Message = (key, body)
Hash of key determines topic partition
Carefully choose key to preserve ordering, e.g. stock ticker symbol => all prices for same symbol end up in same partition
Makes request to topic partition’s leader
@crichardson
Kafka consumer
Consumes the messages from the partitions of one or more topics
Makes a fetch request to a topic partition’s leader
specifies the partition offset in each request
gets back a chunk of messages
Scale by having N topic partitions, N consumers
@crichardson
Kafka consumers - between a rock and a hard place
Simple Kafka consumer
Very flexible
BUT you are responsible for contacting leaders for each topics’ partition, storing offsets
High level consumer
Does a lot: stores offsets in Zookeeper, deals with leaders, ….
BUT it assumes that if you read a message it has been processed
More flexible consumer is on the way
@crichardson
High-level consumer API interface ConsumerConnector { static create(…. Zookeeper configuration…);
public <K,V> Map<String, List<KafkaStream<K,V>>> createMessageStreams(Map<String, Integer> topicCountMap, Decoder<K> keyDecoder, Decoder<V> valueDecoder);
public void commitOffsets(); }
class KafkaStream<K, V> { ConsumerIterator<K,V> iterator() }
interface ConsumerIterator<K,V> { MessageAndMetadata<K, V> next() boolean hasNext() }
@crichardson
Kafka at LinkedIn1100 Kafka brokers organized into more than 60 clusters.
Writes:
Over 800 billion messages per day
Over 175 terabytes of data
Over 650 terabytes of messages are consumed daily
Peak
13 million messages per second
2.75 gigabytes of data per second
https://engineering.linkedin.com/kafka/running-kafka-scale
@crichardson
Summary
Zookeeper, Helix and Kafka are excellent building blocks for distributed systems
@crichardson
@crichardson chris@chrisrichardson.net
http://plainoldobjects.com http://microservices.io