Overview of Zookeeper, Helix and Kafka (Oakjug)

transcript

@crichardson

Distributed system goodies: Zookeeper, Helix and Kafka

Chris Richardson

Author of POJOs in Action Founder of the original CloudFoundry.com

@crichardson chris@chrisrichardson.net http://plainoldobjects.com http://microservices.io

@crichardson

Presentation goal

Talk about a collection of interesting technologies for building distributed

systems

@crichardson

About Chris

@crichardson

About Chris

Founder of a startup that’s creating a platform for developing

event-driven microservices (http://bit.ly/trialeventuate)

@crichardson

For more information

https://github.com/cer/event-sourcing-examples

https://github.com/cer/microservices-examples

http://microservices.io

http://plainoldobjects.com/

https://twitter.com/crichardson

@crichardson

Agenda

Zookeeper

@crichardson

Apache ZooKeeper is an open source distributed configuration service, synchronization service, and naming registry for large distributed systems

https://zookeeper.apache.org/

@crichardson

Distributed system use cases…

Name service

lookup by name,

e.g. service discovery: name => [host, port]*

Group membership

E.g. distributed cache

Cluster members need to talk amongst themselves

Clients need to discover the group members

@crichardson

…Use casesLeader election

N servers, one of which needs to be the master

e.g. master/slave replication

Distributed locking and latches

e.g. cluster wide singleton

Queues

@crichardson

Zookeeper serverIn-

memory DB

datadirsnapshot logs

txn logs

Zookeeper serverIn-

memory DB

txn logs

Zookeeper serverIn-

memory DB

txn logs

ZAB ZAB

Client

Majority-based

Leader FollowerFollower

@crichardson

Zookeeper clientsLanguages:

Ships with Java, C, Perl, and Python

Community: Scala, NodeJS, Go, Lua, …

Client connects to one of a list of servers

Client establishes a session

Survives TCP disconnects

Client-specified session timeout

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZKClientBindings

Zookeeper data modelHierarchical tree of named znodes

Znodes have binary data and children

Znodes can be ephemeral - live for as long as the client session

Clients can watch a node - get notified of changes

@crichardson

Zookeeper operationscreate(path, data, mode)

Persistence or ephemeral?

Sequential: append parent’s counter value to name?

delete(path)

exists(path)

readData(path, watch?) : Object

writeData(path, data)

getChildren(path, watch?) : List[String]

@crichardson

Znode watches

readData/getChildren can establish a watch

client gets a one-time notification when changed

@crichardson

Using the zkCli$ bin/zkCli.sh -server $DOCKER_HOST_IP [zk] create /cer x Created /cer [zk] create /cer/foo y Created /cer/foo

[zk] get /cer/foo watch y

[zk] set /cer/foo z set /cer/foo z

WatchedEvent state:SyncConnected type:NodeDataChanged path:/cer/foo

@crichardson

Creating an ephemeral sequential node

[zk] create -s -e /cer/baz aa Created /cer/baz0000000001]

[zk] ls /cer watch ls /cer watch [baz0000000001, foo]

[Zk] exit

WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/cer [zk] ls /cer watch ls /cer watch [foo]

@crichardson

Leader election example/

myAppElection

guidA_0 hostA, portA

guidB_1 hostB, portB

guidC_2 hostC, portC

Server A guidA

val = 0

Server B guidB

val = 1

Server C guidC

val = 2

watches watches

Ephemeral/Sequential

znodes

Leader

Lowest value

@crichardson

Apache Curator

Open source library developed by Netflix

Simplifies connection management

Simplifies error handling

Implements recipes

Three projects: client, framework, and recipes

http://techblog.netflix.com/2011/11/introducing-curator-netflix-zookeeper.html

@crichardson

Netflix Exhibitor

Supervisory process for managing a Zookeeper instance

Watches a ZK instance and makes sure it is running

Performs periodic backups

Perform periodic cleaning of ZK log directory

A GUI explorer for viewing ZK nodes

A rich REST API

https://github.com/Netflix/exhibitor/wiki

@crichardson

Agenda

Zookeeper

@crichardson

About Helix

http://helix.apache.org/

Built on Zookeeper

@crichardson

Typical distributed systems

Partitioning - e.g. use a PK (or other attribute) to choose server

Replication - for availability

State machines, e.g. master/slave replication

One replica is the master

Other replica is the slave

@crichardson

Use cases - master/slave replication

MySQL master/slave replication or MongoDB replica sets

N machines

1 master, N slaves

If the master dies then elect a new master

@crichardson

Use cases - Cassandra

Cluster consists of N nodes

Data consists of M partitions (aka vnodes)

Each partition has R replicas

Client can read/write any replica - no master/slave concept

Dynamic assignment of M*R partition replicas to N nodes

@crichardson

Use case - abstractlyCluster:

Set of N nodes (machines)

One or more resources

A resource is

partitioned and replicated

Resource has a state machine

e.g. offline/online, master/slave

State machine has constraints: 1 master replica, other replicas are slaves

dynamically assigns partitions to nodes

Manages state transitions and notifies nodes

@crichardson

Leader/standby state machine

Standby

LeaderDropped

Offline

@crichardson

Example assignmentNode 1 Node 2 Node 3

Partition 1 LEADER

Partition 1 STANDBY

Partition 3 LEADER

Partition 2 LEADER

Partition 2 STANDBY

Partition 3 STANDBY

Partition 2 OFFLINE

Partition 1 OFFLINE

Partition 3 OFFLINE

@crichardson

Post-failure assignmentNode 1 Node 2 Node 3

Partition 1 LEADER

Partition 1 STANDBY

Partition 3 LEADER

Partition 2 LEADER

Partition 3 STANDBY

Partition 2 STANDBY

Partition 1 STANDBY

Partition 3 OFFLINEX

@crichardson

Helix cluster setup

val admin = new ZKHelixAdmin(ZK_ADDRESS)

admin.addStateModelDef(clusterName, STATE_MODEL_NAME, new StateModelDefinition(StateModelConfigGenerator.generateConfigForLeaderStandby()));

admin.addResource(clusterName, RESOURCE_NAME, NUM_PARTITIONS, STATE_MODEL_NAME, "AUTO")

HelixControllerMain.startHelixController(ZK_ADDRESS, clusterName, nodeInfo.nodeId.id, HelixControllerMain.STANDALONE)

@crichardson

Adding an instance to the cluster val ic = new InstanceConfig(nodeInfo.nodeId.id) ic.setHostName(nodeInfo.host) ic.setPort("" + nodeInfo.port) ic.setInstanceEnabled(true)

admin.addInstance(clusterName, ic)

admin.rebalance(clusterName, RESOURCE_NAME, NUM_REPLICAS)

Assign to newly added nodes

@crichardson

Helix - connecting to the cluster

manager = HelixManagerFactory.getZKHelixManager(clusterName, instanceName, InstanceType.PARTICIPANT, ZK_ADDRESS)

val stateModelFactory = new MyStateModelFactory val stateMach = manager.getStateMachineEngine stateMach.registerStateModelFactory(STATE_MODEL_NAME, stateModelFactory)

manager.connect()

Connect as a participant

Supply factory to create callbacks for state transitions

@crichardson

State transition callbacks

class MyStateModel(partitionName: String) extends StateModel {

def onBecomeStandbyFromOffline(message: Message, context: NotificationContext) { … }

def onBecomeLeaderFromStandby(message: Message, context: NotificationContext) { … }

class MyStateModelFactory extends StateModelFactory[StateModel] { def createNewStateModel(partitionName: String) = new MyStateModel(partitionName)

} <resourceName>_<partitionNumber>

invoked by Helix

@crichardson

More about HelixSpectators

Non-participants - don’t have resources/partitioned assigned to them

Get notified of changes to cluster

Property store

Write through cache of properties in Zookeeper

Messaging

Intra-cluster communication

@crichardson

Agenda

Zookeeper

@crichardson

Kakfa concepts - topicClients publish messages to a topic

A topics has a name

A topic is a partitioned log

Topics live on disk

Messages have an offset within partition

Messages are kept for a retention period

@crichardson

Kafka is clusteredKafka cluster consists of N machines

Each topic partition has R replicas

1 machine is the leader (think master) for the topic partition

Clients publish/consume to/from leader

R - 1 machines are followers (think slaves)

Followers consume messages from the leader

Messages are committed when all replicas have written to the log

Producers can optionally wait for a message to be committed

Consumers only ever see committed messages

@crichardson

Kafka producers

Publish message to a topic

Message = (key, body)

Hash of key determines topic partition

Carefully choose key to preserve ordering, e.g. stock ticker symbol => all prices for same symbol end up in same partition

Makes request to topic partition’s leader

@crichardson

Kafka consumer

Consumes the messages from the partitions of one or more topics

Makes a fetch request to a topic partition’s leader

specifies the partition offset in each request

gets back a chunk of messages

Scale by having N topic partitions, N consumers

@crichardson

Kafka consumers - between a rock and a hard place

Simple Kafka consumer

Very flexible

BUT you are responsible for contacting leaders for each topics’ partition, storing offsets

High level consumer

Does a lot: stores offsets in Zookeeper, deals with leaders, ….

BUT it assumes that if you read a message it has been processed

More flexible consumer is on the way

@crichardson

High-level consumer API interface ConsumerConnector { static create(…. Zookeeper configuration…);

public <K,V> Map<String, List<KafkaStream<K,V>>> createMessageStreams(Map<String, Integer> topicCountMap, Decoder<K> keyDecoder, Decoder<V> valueDecoder);

public void commitOffsets(); }

class KafkaStream<K, V> { ConsumerIterator<K,V> iterator() }

interface ConsumerIterator<K,V> { MessageAndMetadata<K, V> next() boolean hasNext() }

@crichardson

Kafka at LinkedIn1100 Kafka brokers organized into more than 60 clusters.

Writes:

Over 800 billion messages per day

Over 175 terabytes of data

Over 650 terabytes of messages are consumed daily

13 million messages per second

2.75 gigabytes of data per second

https://engineering.linkedin.com/kafka/running-kafka-scale

@crichardson

Summary

Zookeeper, Helix and Kafka are excellent building blocks for distributed systems

@crichardson

@crichardson chris@chrisrichardson.net

http://plainoldobjects.com http://microservices.io

Overview of Zookeeper, Helix and Kafka (Oakjug)

Software