Date post: | 06-May-2015 |
Category: |
Technology |
Upload: | kishore-gopalakrishna |
View: | 1,753 times |
Download: | 5 times |
Apache Helix: Simplifying Distributed SystemsKanak Biscuitwala and Jason Zhang
helix.incubator.apache.org@apachehelix
• Background• Resource Assignment Problem• Helix Concepts• Putting Concepts to Work• Getting Started• Plugins• Current Status
Outline
Building distributed systems is hard.
Responding to node entry and
exit
Alerting based on metrics
Managing data replicas
Supporting event listeners
Load balancing
Helix abstracts away problems distributed systems need to solve.
Responding to node entry and
exit
Alerting based on metrics
Managing data replicas
Supporting event listeners
Load balancing
System Lifecycle
Single Node
Multi-NodePartitioningDiscovery
Co-Location
Fault Tolerance
ReplicationFault Detection
Recovery
Cluster Expansion
Throttle movementRedistribute data
Resource Assignment Problem
NODESRESOURCES
Resource Assignment Problem
25%
25%
25%
25%
Resource Assignment ProblemSample Allocation
NODESRESOURCES
34%
33%
33%
Resource Assignment ProblemFailure Handling
NODESRESOURCES
Resource Assignment Problem
ZooKeeper provides low-level primitives We need high-level primitives
Application
ZooKeeperFile system
LockEphemeral
Application
Helix
Consensus System
NodePartitionReplicaState
Transition
Making it Work: Take 1 (ZooKeeper)
Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)
Consensus System
config changesnode updates node changes
S S
S
S
S service running on a node
Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)
Consensus System
config changesnode updatesnode changes multiple brains
app-specific logic
unscalable traffic
S
S
S S
Resource Assignment ProblemMaking it Work: Take 2 (Decisions by Nodes)
Consensus System
config changesnode updatesnode changes multiple brains
app-specific logic
unscalable traffic
S
S
S S
Resource Assignment ProblemMaking it Work: Take 3 (Single Brain)
Controller Consensus System S
S
S
config changesnode changes
node updates
node updates
Node logic is drastically simplified!
Controller NODES (Participants)
Spectators
ControllerController
ManageRESOURCES
Resource Assignment ProblemHelix View
Controller NODES (Participants)
Spectators
ControllerController
ManageRESOURCES
Resource Assignment ProblemHelix View
Question: How do we make this controller generic enough to work for different resources?
Helix Concepts
Resource
Partition PartitionPartition
All partitions can be replicated.
Helix ConceptsResources
Offline MasterSlave
Helix ConceptsDeclarative State Model
State Constraints
MASTER: [1, 1]SLAVE: [0, R]
Special Constraint ValuesR: Replica count per partitionN: Number of participants
Helix ConceptsConstraints: Augmenting the State Model
State Constraints Transition Constraints
MASTER: [1, 1]SLAVE: [0, R]
Scope: ClusterOFFLINE-SLAVE: 3 concurrent
Scope: Resource R1SLAVE-MASTER: 1 concurrentSpecial Constraint Values
R: Replica count per partitionN: Number of participants
Helix ConceptsConstraints: Augmenting the State Model
Scope: Participant P4OFFLINE-SLAVE: 2 concurrent
State Constraints Transition Constraints
MASTER: [1, 1]SLAVE: [0, R]
Scope: ClusterOFFLINE-SLAVE: 3 concurrent
Scope: Resource R1SLAVE-MASTER: 1 concurrent
States and transitions are ordered by priority in computing replica states.
Special Constraint ValuesR: Replica count per partitionN: Number of participants
Helix ConceptsConstraints: Augmenting the State Model
Scope: Participant P4OFFLINE-SLAVE: 2 concurrent
Transition constraints can be restricted to cluster, resource, and participant scopes. The most restrictive constraint is used.
Resource
Partition PartitionPartition
All partitions can be replicated.Each replica is in a state governed by the augmented state model.
Helix ConceptsResources and the Augmented State Model
master
slave
offline
Failure and Expansion Semantics
Partition Placement
Helix ConceptsObjectives
Distribution policy for partitions and replicas
Changing existing replica statesCreate new replicas and assign states
Making effective use of the cluster and the resource
Putting Concepts to Work
Rebalancing StrategiesMeeting Objectives within Constraints
Full-Auto Semi-Auto Customized User-Defined
Replica Placement
Replica State
Helix App App
App code plugged into the Helix controller
Helix Helix App
App code plugged into the Helix controller
Rebalancing StrategiesFull-Auto
P1: M P3: M
P2: S P1: S
P2: M
P3: S
By default, Helix optimizes for minimal movement and even distribution of partitions and states
Node 1 Node 2 Node 3
Rebalancing StrategiesFull-Auto
P1: M P3: M
P2: S P1: S
P2: M
P3: S
By default, Helix optimizes for minimal movement and even distribution of partitions and states
Node 1 Node 2 Node 3
Rebalancing StrategiesFull-Auto
P1: M
P3: M
P2: S
P1: S
P2: M
P3: S
By default, Helix optimizes for minimal movement and even distribution of partitions and states
Node 1 Node 2 Node 3
Rebalancing StrategiesSemi-Auto
P1: M P3: M
P2: S P1: S
P2: M
P3: S
Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
Node 1 Node 2 Node 3
Rebalancing StrategiesSemi-Auto
P1: M P3: M
P2: S P1: S
P2: M
P3: S
Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
Node 1 Node 2 Node 3
Rebalancing StrategiesSemi-Auto
P1: M P3: M
P2: S P1: S
P2: M
P3: S
Semi-Auto mode maintains the location of the replicas, but allows Helix to adjust the states to follow the state constraints.
This is ideal for resources that are expensive to move.
P3: M
Node 1 Node 2 Node 3
Rebalancing StrategiesCustomized
The app specifies the location and state of each replica. Helix still ensures that transitions
are fired according to constraints.
Rebalancing StrategiesCustomized
The app specifies the location and state of each replica. Helix still ensures that transitions
are fired according to constraints.
Need to respond to node changes? Use the Helix custom code invoker to run on one
participant, or...
Rebalancing StrategiesUser-Defined
Rebalancer implemented by app computes
replica placement and state
Helix fires transitions without
violating constraints
Helix controller
invokes code plugged in by the
app
Node joins or leaves the cluster
The rebalancer receives a full snapshot of the current cluster state, as well as access to the backing data store. Helix
rebalancers implement the same interface.
Rebalancing StrategiesUser-Defined: Distributed Lock Manager
Offline Locked
Released
Node 1
Node 2
Each lock is a partition!
Rebalancing StrategiesUser-Defined: Distributed Lock Manager
Offline Locked
Released
Node 1
Node 2
Node 3
Each lock is a partition!
Rebalancing StrategiesUser-Defined: Distributed Lock Manager
public ResourceAssignment computeResourceMapping( Resource resource, IdealState currentIdealState, CurrentStateOutput currentStateOutput, ClusterDataCache clusterData) { ... int i = 0; for (Partition partition : resource.getPartitions()) { Map<String, String> replicaMap = new HashMap<String, String>(); int participantIndex = i % liveParticipants.size(); String participant = liveParticipants.get(participantIndex); replicaMap.put(participant, “LOCKED”); assignment.addReplicaMap(partition, replicaMap); i++; } return assignment;}
Rebalancing StrategiesUser-Defined: Distributed Lock Manager
public ResourceAssignment computeResourceMapping( Resource resource, IdealState currentIdealState, CurrentStateOutput currentStateOutput, ClusterDataCache clusterData) { ... int i = 0; for (Partition partition : resource.getPartitions()) { Map<String, String> replicaMap = new HashMap<String, String>(); int participantIndex = i % liveParticipants.size(); String participant = liveParticipants.get(participantIndex); replicaMap.put(participant, “LOCKED”); assignment.addReplicaMap(partition, replicaMap); i++; } return assignment;}
ControllerFault Tolerance
Offline LeaderStandby
The augmented state model concept applies to controllers too!
ControllerScalability
Controller 1
Controller 2
Controller 3
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
ControllerScalability
Controller 1
Controller 2
Controller 3
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
ZooKeeper View
{ "id" : "SampleResource", "simpleFields" : { "REBALANCE_MODE" : "USER_DEFINED", "NUM_PARTITIONS" : "2", "REPLICAS" : "2", "STATE_MODEL_DEF_REF" : "MasterSlave", "STATE_MODEL_FACTORY_NAME" : "DEFAULT" }, "mapFields" : { "SampleResource_0" : { "node1_12918" : "MASTER", "node2_12918" : "SLAVE" } ... }, "listFields" : {}}
Ideal State
P1 P2
N1: M N2: M
N2: S N1: S
Replica Placement and State
ZooKeeper ViewCurrent State and External View
P1 P2
N1: M N1: M
N2: O N2: O
Helix’s responsibility is to make the external view match the ideal state as closely as possible
External View
N2 P1: OFFLINEP2: OFFLINE
N1 P1: MASTERP2: MASTER
Current State
Logical Deployment
ZooKeeper Helix Controller
Spectator
Participant Participant Participant
Helix Agent
Helix Agent Helix Agent Helix Agent
P1: M P2: M P3: MP2: S P3: S P1: S
Getting Started
Node 1
P.10P.9
P.5 P.6P.4
P.2 P.3P.1
Node 2
P.12P.11
P.1 P.2P.8
P.6 P.7P.5
Node 3
P.8P.7
P.3 P.4P.12
P.10 P.11P.9Master
Slave
Partition Management• multiple replicas• 1 master• even distribution
Fault Tolerance• fault detection• promote master to slave
• even distribution• no SPOF
Elasticity• minimize downtime• minimize data movement
• throttle movement
Example: Distributed Data Store
Define state model
state transitions
Configure create cluster
add nodes add resource
config rebalancer
Run start controller
start participants
Helix-Based SolutionExample: Distributed Data Store
States all possible states priority
Transitions legal transitions priority
Applicable to each partition of a resource
Offline Master
Slave
Example: Distributed Data StoreState Model Definition: Master-Slave
builder = new StateModelDefinition.Builder(“MasterSlave”);
// add states and their ranks to indicate priority builder.addState(MASTER, 1); builder.addState(SLAVE, 2); builder.addState(OFFLINE);
// set the initial state when participant starts builder.initialState(OFFLINE);
// add transitions builder.addTransition(OFFLINE, SLAVE); builder.addTransition(SLAVE, OFFLINE); builder.addTransition(SLAVE, MASTER); builder.addTransition(MASTER, SLAVE);
Example: Distributed Data StoreState Model Definition: Master-Slave
State TransitionPartitionResourceNodeCluster
Y Y- YY Y- Y
Offline Master
Slave
StateCount=2
StateCount=1
Example: Distributed Data StoreDefining Constraints
// static constraints builder.upperBound(MASTER, 1);
// dynamic constraints builder.dynamicUpperBound(SLAVE, “R”);
// unconstrained builder.upperBound(OFFLINE, -‐1);
Example: Distributed Data StoreDefining Constraints: Code
@StateModelInfo(initialState=“OFFLINE”, states={“OFFLINE”, “SLAVE”, “MASTER”})class DistributedDataStoreModel extends StateModel { @Transition(from=“OFFLINE”, to=“SLAVE”) public void fromOfflineToSlave(Message m, NotificationContext ctx) { // bootstrap data, setup replication, etc. } @Transition(from=“SLAVE”, to=“MASTER”) public void fromSlaveToMaster(Message m, NotificationContext ctx) { // catch up previous master, enable writes, etc. } ...}
Example: Distributed Data StoreParticipant Plug-In Code
HelixAdmin -‐zkSvr <zk-‐address>
Create Cluster-‐-‐ addCluster MyCluster
Add Participants-‐-‐ addNode MyCluster localhost_12000...
Add Resource-‐-‐ addResource MyDB 16 MasterSlave SEMI_AUTO
Configure Rebalancer-‐-‐ rebalance MyDB 3
Example: Distributed Data StoreConfigure and Run
class RoutingLogic { public void write(Request request) { partition = getPartition(request.key); List<Node> nodes = routingTableProvider.getInstance(partition, “MASTER”); nodes.get(0).write(request); }
public void read(Request request) { partition = getPartition(request.key); List<Node> nodes = routingTableProvider.getInstance(partition); random(nodes).read(request); }
Example: Distributed Data StoreSpectator Plug-In Code
Example: Distributed Data StoreWhere is the Code?
Controller Consensus System
Participant
config changesnode changes
node updatesnode updates
Participant Plug-In Code
Spectator Plug-In Code
Participant Plug-In Code
Participant
Spectator
Node 1
P.4P.2 P.3P.1
Node 2
P.6P.4 P.5P.3
Node 3
P.2P.6 P.1P.5
Partition Management• multiple replicas• rack-aware placement• even distribution
Fault Tolerance• fault detection• auto create replicas• controlled creation of replicas
Elasticity• redistribute partitions• minimize data movement
• throttle movement
Index shard
Example: Distributed Search
Offline
Online
Bootstrap
Idle
Error
setup node
cleanup
consume data to build index
stop consume data
can serve requests
stop indexing and serving
recover
StateCount=5
StateCount=3
Example: Distributed SearchState Model Definition: Bootstrap
Create Cluster-‐-‐ addCluster MyCluster
Add Participants-‐-‐ addNode MyCluster localhost_12000...
Add Resource-‐-‐ addResource MyIndex 16 Bootstrap CUSTOMIZED
Configure Rebalancer-‐-‐ rebalance MyIndex 8
Example: Distributed SearchConfigure and Run
C1
Partition Management• one consumer per queue• even distribution
Elasticity• redistribute queues among consumers• minimize movement
Fault Tolerance• redistribute• minimize data movement
• limit max queue per consumer
C2
Partitioned Queue Consumer
C1
C2
C1
C2
C3 C3
Example: Message Consumers
Partitioned Queue Consumer Partitioned
Queue Consumer
Assignment Scaling Fault Tolerance
Offline Online
Start consumption
Stop consumption
Max 10 queues per consumerStateCount = 1
Example: Message ConsumersState Model Definition: Online-Offline
@StateModelInfo(initialState=“OFFLINE”, states={“OFFLINE”, “ONLINE”})class MessageConsumerModel extends StateModel { @Transition(from=“OFFLINE”, to=“ONLINE”) public void fromOfflineToOnline(Message m, NotificationContext ctx) { // register listener } @Transition(from=“ONLINE”, to=“OFFLINE”) public void fromOnlineToOffline(Message m, NotificationContext ctx) { // unregister listener }}
Example: Message ConsumersParticipant Plug-In Code
Plugins
Chaos MonkeyData-Driven Testing and Debugging
Intra-Cluster Messaging
Rolling Upgrade
On-Demand Task
SchedulingHealth
Monitoring
PluginsOverview
Analyze invariants like state
and transition constraints
Simulate execution with Chaos Monkey
Instrument ZK, controller, and participant logs
Data-Driven Testing and Debugging
The exact sequence of events can be replayed: debugging made easy!
Plugins
Data-Driven Testing and Debugging: Sample Log Filetimestamp partition participantName sessionId state
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1.32331E+12 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1.32331E+12 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1.32331E+12 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
Plugins
Data-Driven Testing and Debugging: Count Aggregation
Time State Slave Count Participant42632 OFFLINE 0 10.117.58.247_1291842796 SLAVE 1 10.117.58.247_1291843124 OFFLINE 1 10.202.187.155_1291843131 OFFLINE 1 10.220.225.153_1291843275 SLAVE 2 10.220.225.153_1291843323 SLAVE 3 10.202.187.155_1291885795 MASTER 2 10.220.225.153_12918
Error! The state constraint for SLAVE has an upper bound of 2.
Plugins
Data-Driven Testing and Debugging: Time Aggregation
Slave Count Time Percentage0 1082319 0.51 35578388 16.462 179417802 82.993 118863 0.05
Master Count Time Percentage0 1082319 0.51 35578388 16.46
83% of the time, there were 2 slaves to a partition
93% of the time, there was 1 master to a partition
We can see for exactly how long the cluster was out of whack.
Plugins
Current Status
Helix at LinkedIn
OraclePrimary
DBEspresso
Data Change Events
Databus
Updates
Standardization
Search Index
Graph Index
Read Replicas
Coming Up Next
Automatic scaling with YARNNew APIs Non-JVM
participants
• Helix: A generic framework for building distributed systems
• Abstraction and modularity allow for modifying and enhancing system behavior
• Simple programming model: declarative state machine
Summary
Questions?
website
dev mailing list
user mailing list
helix.incubator.apache.org
@apachehelix?