Helix talk at RelateIQ

Apache Helix: Simplifying Distributed SystemsKanak Biscuitwala and Jason Zhang

helix.incubator.apache.org@apachehelix

• Background• Resource Assignment Problem• Helix Concepts• Putting Concepts to Work• Getting Started• Plugins• Current Status

Outline

Building distributed systems is hard.

Responding to node entry and

exit

Alerting based on metrics

Managing data replicas

Supporting event listeners

Load balancing

Helix abstracts away problems distributed systems need to solve.

Responding to node entry and

exit

Alerting based on metrics

Managing data replicas

Supporting event listeners

Load balancing

System Lifecycle

Single Node

Multi-NodePartitioningDiscovery

Co-Location

Fault Tolerance

ReplicationFault Detection

Recovery

Cluster Expansion

Throttle movementRedistribute data

Resource Assignment Problem

NODESRESOURCES


25%

25%

25%

25%

Resource Assignment ProblemSample Allocation

NODESRESOURCES

34%

33%

33%

Resource Assignment ProblemFailure Handling

NODESRESOURCES


ZooKeeper provides low-level primitives We need high-level primitives

Application

ZooKeeperFile system

LockEphemeral

Application

Helix

Consensus System

NodePartitionReplicaState

Transition

Making it Work: Take 1 (ZooKeeper)

Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)

Consensus System

config changesnode updates node changes

S S

S

S

S service running on a node


Consensus System

config changesnode updatesnode changes multiple brains

app-specific logic

unscalable traffic

S

S

S S


Consensus System

config changesnode updatesnode changes multiple brains

app-specific logic

unscalable traffic

S

S

S S

Resource Assignment ProblemMaking it Work: Take 3 (Single Brain)

Controller Consensus System S

S

S

config changesnode changes

node updates

node updates

Node logic is drastically simplified!

Controller NODES (Participants)

Spectators

ControllerController

ManageRESOURCES

Resource Assignment ProblemHelix View

Controller NODES (Participants)

Spectators

ControllerController

ManageRESOURCES

Resource Assignment ProblemHelix View

Question: How do we make this controller generic enough to work for different resources?

Helix Concepts

Resource

Partition PartitionPartition

All partitions can be replicated.

Helix ConceptsResources

Offline MasterSlave

Helix ConceptsDeclarative State Model

State Constraints

MASTER: [1, 1]SLAVE: [0, R]

Special Constraint ValuesR: Replica count per partitionN: Number of participants

Helix ConceptsConstraints: Augmenting the State Model

State Constraints Transition Constraints


Scope: ClusterOFFLINE-SLAVE: 3 concurrent

Scope: Resource R1SLAVE-MASTER: 1 concurrentSpecial Constraint Values

R: Replica count per partitionN: Number of participants


Scope: Participant P4OFFLINE-SLAVE: 2 concurrent

State Constraints Transition Constraints


Scope: ClusterOFFLINE-SLAVE: 3 concurrent

Scope: Resource R1SLAVE-MASTER: 1 concurrent

States and transitions are ordered by priority in computing replica states.

Special Constraint ValuesR: Replica count per partitionN: Number of participants


Scope: Participant P4OFFLINE-SLAVE: 2 concurrent

Transition constraints can be restricted to cluster, resource, and participant scopes. The most restrictive constraint is used.

Resource

Partition PartitionPartition

All partitions can be replicated.Each replica is in a state governed by the augmented state model.

Helix ConceptsResources and the Augmented State Model

master

slave

offline

Failure and Expansion Semantics

Partition Placement

Helix ConceptsObjectives

Distribution policy for partitions and replicas

Changing existing replica statesCreate new replicas and assign states

Making effective use of the cluster and the resource

Putting Concepts to Work

Rebalancing StrategiesMeeting Objectives within Constraints

Full-Auto Semi-Auto Customized User-Defined

Replica Placement

Replica State

Helix App App

App code plugged into the Helix controller

Helix Helix App

App code plugged into the Helix controller

Rebalancing StrategiesFull-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

By default, Helix optimizes for minimal movement and even distribution of partitions and states

Node 1 Node 2 Node 3


P1: M P3: M

P2: S P1: S

P2: M

P3: S




P1: M

P3: M

P2: S

P1: S

P2: M

P3: S



Rebalancing StrategiesSemi-Auto

P1: M P3: M

P2: S P1: S

P2: M

P3: S

Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.

This is ideal for resources that are expensive to move.



P1: M P3: M

P2: S P1: S

P2: M

P3: S





P1: M P3: M

P2: S P1: S

P2: M

P3: S



P3: M


Rebalancing StrategiesCustomized

The app specifies the location and state of each replica. Helix still ensures that transitions

are fired according to constraints.

Rebalancing StrategiesCustomized

The app specifies the location and state of each replica. Helix still ensures that transitions

are fired according to constraints.

Need to respond to node changes? Use the Helix custom code invoker to run on one

participant, or...

Rebalancing StrategiesUser-Defined

Rebalancer implemented by app computes

replica placement and state

Helix fires transitions without

violating constraints

Helix controller

invokes code plugged in by the

app

Node joins or leaves the cluster

The rebalancer receives a full snapshot of the current cluster state, as well as access to the backing data store. Helix

rebalancers implement the same interface.

Rebalancing StrategiesUser-Defined: Distributed Lock Manager

Offline Locked

Released

Node 1

Node 2

Each lock is a partition!


Offline Locked

Released

Node 1

Node 2

Node 3

Each lock is a partition!


public ResourceAssignment computeResourceMapping( Resource resource, IdealState currentIdealState, CurrentStateOutput currentStateOutput, ClusterDataCache clusterData) { ... int i = 0; for (Partition partition : resource.getPartitions()) { Map<String, String> replicaMap = new HashMap<String, String>(); int participantIndex = i % liveParticipants.size(); String participant = liveParticipants.get(participantIndex); replicaMap.put(participant, “LOCKED”); assignment.addReplicaMap(partition, replicaMap); i++; } return assignment;}


public ResourceAssignment computeResourceMapping( Resource resource, IdealState currentIdealState, CurrentStateOutput currentStateOutput, ClusterDataCache clusterData) { ... int i = 0; for (Partition partition : resource.getPartitions()) { Map<String, String> replicaMap = new HashMap<String, String>(); int participantIndex = i % liveParticipants.size(); String participant = liveParticipants.get(participantIndex); replicaMap.put(participant, “LOCKED”); assignment.addReplicaMap(partition, replicaMap); i++; } return assignment;}

ControllerFault Tolerance

Offline LeaderStandby

The augmented state model concept applies to controllers too!

ControllerScalability

Controller 1

Controller 2

Controller 3

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

ControllerScalability

Controller 1

Controller 2

Controller 3

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

ZooKeeper View

{ "id" : "SampleResource", "simpleFields" : { "REBALANCE_MODE" : "USER_DEFINED", "NUM_PARTITIONS" : "2", "REPLICAS" : "2", "STATE_MODEL_DEF_REF" : "MasterSlave", "STATE_MODEL_FACTORY_NAME" : "DEFAULT" }, "mapFields" : { "SampleResource_0" : { "node1_12918" : "MASTER", "node2_12918" : "SLAVE" } ... }, "listFields" : {}}

Ideal State

P1 P2

N1: M N2: M

N2: S N1: S

Replica Placement and State

ZooKeeper ViewCurrent State and External View

P1 P2

N1: M N1: M

N2: O N2: O

Helix’s responsibility is to make the external view match the ideal state as closely as possible

External View

N2 P1: OFFLINEP2: OFFLINE

N1 P1: MASTERP2: MASTER

Current State

Logical Deployment

ZooKeeper Helix Controller

Spectator

Participant Participant Participant

Helix Agent

Helix Agent Helix Agent Helix Agent

P1: M P2: M P3: MP2: S P3: S P1: S

Getting Started

Node 1

P.10P.9

P.5 P.6P.4

P.2 P.3P.1

Node 2

P.12P.11

P.1 P.2P.8

P.6 P.7P.5

Node 3

P.8P.7

P.3 P.4P.12

P.10 P.11P.9Master

Slave

Partition Management• multiple replicas• 1 master• even distribution

Fault Tolerance• fault detection• promote master to slave

• even distribution• no SPOF

Elasticity• minimize downtime• minimize data movement

• throttle movement

Example: Distributed Data Store

Define state model

state transitions

Configure create cluster

add nodes add resource

config rebalancer

Run start controller

start participants

Helix-Based SolutionExample: Distributed Data Store

States all possible states priority

Transitions legal transitions priority

Applicable to each partition of a resource

Offline Master

Slave

Example: Distributed Data StoreState Model Definition: Master-Slave

builder = new StateModelDefinition.Builder(“MasterSlave”);

// add states and their ranks to indicate priority builder.addState(MASTER, 1); builder.addState(SLAVE, 2); builder.addState(OFFLINE);

// set the initial state when participant starts builder.initialState(OFFLINE);

// add transitions builder.addTransition(OFFLINE, SLAVE); builder.addTransition(SLAVE, OFFLINE); builder.addTransition(SLAVE, MASTER); builder.addTransition(MASTER, SLAVE);

Example: Distributed Data StoreState Model Definition: Master-Slave

State TransitionPartitionResourceNodeCluster

Y Y- YY Y- Y

Offline Master

Slave

StateCount=2

StateCount=1

Example: Distributed Data StoreDefining Constraints

// static constraints builder.upperBound(MASTER, 1);

// dynamic constraints builder.dynamicUpperBound(SLAVE, “R”);

// unconstrained builder.upperBound(OFFLINE, -‐1);

Example: Distributed Data StoreDefining Constraints: Code

@StateModelInfo(initialState=“OFFLINE”, states={“OFFLINE”, “SLAVE”, “MASTER”})class DistributedDataStoreModel extends StateModel { @Transition(from=“OFFLINE”, to=“SLAVE”) public void fromOfflineToSlave(Message m, NotificationContext ctx) { // bootstrap data, setup replication, etc. } @Transition(from=“SLAVE”, to=“MASTER”) public void fromSlaveToMaster(Message m, NotificationContext ctx) { // catch up previous master, enable writes, etc. } ...}

Example: Distributed Data StoreParticipant Plug-In Code

HelixAdmin -‐zkSvr <zk-‐address>

Create Cluster-‐-‐ addCluster MyCluster

Add Participants-‐-‐ addNode MyCluster localhost_12000...

Add Resource-‐-‐ addResource MyDB 16 MasterSlave SEMI_AUTO

Configure Rebalancer-‐-‐ rebalance MyDB 3

Example: Distributed Data StoreConfigure and Run

class RoutingLogic { public void write(Request request) { partition = getPartition(request.key); List<Node> nodes = routingTableProvider.getInstance(partition, “MASTER”); nodes.get(0).write(request); }

public void read(Request request) { partition = getPartition(request.key); List<Node> nodes = routingTableProvider.getInstance(partition); random(nodes).read(request); }

Example: Distributed Data StoreSpectator Plug-In Code

Example: Distributed Data StoreWhere is the Code?

Controller Consensus System

Participant

config changesnode changes

node updatesnode updates

Participant Plug-In Code

Spectator Plug-In Code

Participant Plug-In Code

Participant

Spectator

Node 1

P.4P.2 P.3P.1

Node 2

P.6P.4 P.5P.3

Node 3

P.2P.6 P.1P.5

Partition Management• multiple replicas• rack-aware placement• even distribution

Fault Tolerance• fault detection• auto create replicas• controlled creation of replicas

Elasticity• redistribute partitions• minimize data movement

• throttle movement

Index shard

Example: Distributed Search

Offline

Online

Bootstrap

Idle

Error

setup node

cleanup

consume data to build index

stop consume data

can serve requests

stop indexing and serving

recover

StateCount=5

StateCount=3

Example: Distributed SearchState Model Definition: Bootstrap

Create Cluster-‐-‐ addCluster MyCluster

Add Participants-‐-‐ addNode MyCluster localhost_12000...

Add Resource-‐-‐ addResource MyIndex 16 Bootstrap CUSTOMIZED

Configure Rebalancer-‐-‐ rebalance MyIndex 8

Example: Distributed SearchConfigure and Run

C1

Partition Management• one consumer per queue• even distribution

Elasticity• redistribute queues among consumers• minimize movement

Fault Tolerance• redistribute• minimize data movement

• limit max queue per consumer

C2

Partitioned Queue Consumer

C1

C2

C1

C2

C3 C3

Example: Message Consumers

Partitioned Queue Consumer Partitioned

Queue Consumer

Assignment Scaling Fault Tolerance

Offline Online

Start consumption

Stop consumption

Max 10 queues per consumerStateCount = 1

Example: Message ConsumersState Model Definition: Online-Offline

@StateModelInfo(initialState=“OFFLINE”, states={“OFFLINE”, “ONLINE”})class MessageConsumerModel extends StateModel { @Transition(from=“OFFLINE”, to=“ONLINE”) public void fromOfflineToOnline(Message m, NotificationContext ctx) { // register listener } @Transition(from=“ONLINE”, to=“OFFLINE”) public void fromOnlineToOffline(Message m, NotificationContext ctx) { // unregister listener }}

Example: Message ConsumersParticipant Plug-In Code

Plugins

Chaos MonkeyData-Driven Testing and Debugging

Intra-Cluster Messaging

Rolling Upgrade

On-Demand Task

SchedulingHealth

Monitoring

PluginsOverview

Analyze invariants like state

and transition constraints

Simulate execution with Chaos Monkey

Instrument ZK, controller, and participant logs

Data-Driven Testing and Debugging

The exact sequence of events can be replayed: debugging made easy!

Plugins

Data-Driven Testing and Debugging: Sample Log Filetimestamp partition participantName sessionId state

1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE




1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE









Plugins

Data-Driven Testing and Debugging: Count Aggregation

Time State Slave Count Participant42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918

Error! The state constraint for SLAVE has an upper bound of 2.

Plugins

Data-Driven Testing and Debugging: Time Aggregation

Slave Count Time Percentage0 1082319 0.51 35578388 16.462 179417802 82.993 118863 0.05

Master Count Time Percentage0 1082319 0.51 35578388 16.46

83% of the time, there were 2 slaves to a partition

93% of the time, there was 1 master to a partition

We can see for exactly how long the cluster was out of whack.

Plugins

Current Status

Helix at LinkedIn

OraclePrimary

DBEspresso

Data Change Events

Databus

Updates

Standardization

Search Index

Graph Index

Read Replicas

Coming Up Next

Automatic scaling with YARNNew APIs Non-JVM

participants

• Helix: A generic framework for building distributed systems

• Abstraction and modularity allow for modifying and enhancing system behavior

• Simple programming model: declarative state machine

Summary

Questions?

website

dev mailing list

user mailing list

twitter

helix.incubator.apache.org

[email protected]

[email protected]

@apachehelix?

Date post:	06-May-2015
Category:	Technology
Upload:	kishore-gopalakrishna
View:	1,753 times
Download:	5 times

Helix talk at RelateIQ

Technology