+ All Categories
Home > Documents > Dynamic Reconfiguration of Apache Zookeeper

Dynamic Reconfiguration of Apache Zookeeper

Date post: 24-Feb-2016
Category:
Upload: hallam
View: 57 times
Download: 5 times
Share this document with a friend
Description:
Dynamic Reconfiguration of Apache Zookeeper. USENIX ATC 2012, Hadoop Summit 2012 In collaboration with: Ben Reed Dahlia Malkhi Flavio Junqueira was @ Yahoo!, now @ Osmeta MSR Yahoo!. Alex Shraer Yahoo! Research . 1. - PowerPoint PPT Presentation
Popular Tags:
46
1 Dynamic Reconfiguration of Apache Zookeeper USENIX ATC 2012, Hadoop Summit 2012 In collaboration with: Ben Reed Dahlia Malkhi Flavio Junqueira was @ Yahoo!, now @Osmeta MSR Yahoo! Alex Shraer Yahoo! Research
Transcript
Page 1: Dynamic Reconfiguration of  Apache Zookeeper

1

Dynamic Reconfiguration of Apache Zookeeper

USENIX ATC 2012, Hadoop Summit 2012

In collaboration with:

Ben Reed Dahlia Malkhi Flavio Junqueirawas @ Yahoo!, now @Osmeta MSR Yahoo!

Alex Shraer Yahoo! Research

Page 2: Dynamic Reconfiguration of  Apache Zookeeper

Talk Overview• Short intro to Zookeeper

• What is a configuration ?

• Why change it dynamically ?

• How is Zookeeper reconfigured currently ?

• Our solution:– Server side– Client side

Page 3: Dynamic Reconfiguration of  Apache Zookeeper

Modern Distributed Computing

• Lots of servers• Lots of processes• High volumes of data• Highly complex software systems• … mere mortal developers

3

Page 4: Dynamic Reconfiguration of  Apache Zookeeper

Coordination is important

4

Page 5: Dynamic Reconfiguration of  Apache Zookeeper

Coordination primitives Semaphores Queues Locks Leader Election Group Membership Barriers Configuration

Page 6: Dynamic Reconfiguration of  Apache Zookeeper

Coordination is difficult even if we have just 2 replicas…

6

x = 0

x ++ x = x * 2

x = 0x ++x = x * 2

x = x * 2x ++

Page 7: Dynamic Reconfiguration of  Apache Zookeeper

Common approach: use a coordination service

7

• Leading coordination services:• Google: Chubby• Apache Zookeeper

– Yahoo!, Linkedin, Twitter, Facebook, VMWare, UBS, Goldman Sachs, Netflix, Box, Cloudera, MapR, Nicira, …

Page 8: Dynamic Reconfiguration of  Apache Zookeeper

8

Zookeeper data model

• A tree of data nodes (znodes)

• Hierarchical namespace (like in a file system)

• Znode = <data, version, creation flags, children>

/

services

apps

users

workers

locks

worker1

worker2

x-1

x-2

Page 9: Dynamic Reconfiguration of  Apache Zookeeper

ZooKeeper: API• create

– sequential, ephemeral• setData

• can be conditional on current version• getData

• can optionally set a “watch” on the znode• getChildren• exists• delete

Page 10: Dynamic Reconfiguration of  Apache Zookeeper

Client A

B

/

/lock

Client C

Client B

1- create “/lock”, ephemeral2- if success, return with lock3- getData /lock, watch=true4- wait for /lock to go away, then goto 1

Example: Distributed Lock / Leader election

Page 11: Dynamic Reconfiguration of  Apache Zookeeper

Client A

/

Client C

Client B

1- create “/lock”, ephemeral2- if success, return with lock3- getData /lock, watch=true4- wait for /lock to go away, then goto 1

Example: Distributed Lock / Leader election

notification

notification

Page 12: Dynamic Reconfiguration of  Apache Zookeeper

Client A

/

Client C

Client B

1- create “/lock”, ephemeral2- if success, return with lock3- getData /lock, watch=true4- wait for /lock to go away, then goto 1

Example: Distributed Lock / Leader election

C /lock

Page 13: Dynamic Reconfiguration of  Apache Zookeeper

13

Problem ?

• Herd effect– Large number of clients wake up simultaneously– Load spike

• Some client may never get the lock

Page 14: Dynamic Reconfiguration of  Apache Zookeeper

Client A

A

B

Z

/

/C-1

/C-2

/C-nClient Z

Client B

1- z = create “/C-”, sequential, ephemeral2- getChildren “/”3- if z has lowest id return with lock4- getData prev_node, watch=true5- wait for prev_node to go away, then goto 2

Example: Distributed Lock / Leader election

Page 15: Dynamic Reconfiguration of  Apache Zookeeper

Client A

B

Z

/

/C-2

/C-nClient Z

Client B

1- z = create “/C-”, sequential, ephemeral2- getChildren “/”3- if z has lowest id return with lock4- getData prev_node, watch=true5- wait for prev_node to go away, then goto 2

Example: Distributed Lock / Leader election

Page 16: Dynamic Reconfiguration of  Apache Zookeeper

Client A

B

Z

/

/C-2

/C-nClient Z

Client Bnotification

1- z = create “/C-”, sequential, ephemeral2- getChildren “/”3- if z has lowest id return with lock4- getData prev_node, watch=true5- wait for prev_node to go away, then goto 2

Example: Distributed Lock / Leader election

Page 17: Dynamic Reconfiguration of  Apache Zookeeper

ZooKeeper Service

ServerServer ServerServerServerServer

Leader

Zookeeper - distributed and replicated

Client ClientClientClientClientClient ClientClient

• All servers store a copy of the data (in memory)• A leader is elected at startup• Reads served by followers, all updates go through

leader• Update acked when a quorum of servers have

persisted the change (on disk)• Zookeeper uses ZAB - its own atomic broadcast

protocol – Borrows a lot from Paxos, but conceptually different

Page 18: Dynamic Reconfiguration of  Apache Zookeeper

ZooKeeper: Overview

Client App ZooKeeper Client Lib Follower

Leader

Follower

Follower

Follower

Client App ZooKeeper Client Lib

Client App ZooKeeper Client Lib

Session

Session

Session

Replicatedsystem

Leader atomically broadcast updates

Ensemble

Page 19: Dynamic Reconfiguration of  Apache Zookeeper

ZooKeeper: Overview

Client App ZooKeeper Client Lib Follower

Leader

Follower

Follower

Follower

Client App ZooKeeper Client Lib

Client App ZooKeeper Client Lib

setData /x , 5

0Replicated

system

Leader atomically broadcast updates

Ensemble

getData /x

ok

/X = 0

/X = 0

/X = 0

/X = 0

/X = 0

/X = 5

/X = 5

/X = 5

syncok

/X = 55

getData /x

Page 20: Dynamic Reconfiguration of  Apache Zookeeper

20

FIFO order of client’s commands• A client invokes many operations asynchronously

– All commands have a blocking variant, rarely used in practice

• Zookeeper guarantees that a prefix of these operations complete in invocation order

Page 21: Dynamic Reconfiguration of  Apache Zookeeper

• Important subclass of State-Machine Replication• Many (most?) Primary/Backup systems work as follows:

• Primary speculatively executes operations, doesn’t wait for previous ops to commit

• Sends idempotent state updates to backups– “makes sense’’ only in the context of – Primary speculatively executes and sends out but it will only

appear in a backup’s log after – In general State Machine Replication (Paxos), a backup’s log may

become

• Primary order: each primary commits a consecutive segment in the log

Zookeeper is a Primary/Backup system

Page 22: Dynamic Reconfiguration of  Apache Zookeeper

Configuration of a Distributed Replicated System

• Membership

• Role of each server– E.g., deciding on changes (acceptor/follower) or

learning the changes (learner/observer)

• Quorum System spec– Zookeeper: majorities

hierarchical (server votes have different weight)

• Network addresses & ports

• Timeouts, directory paths, etc.

Page 23: Dynamic Reconfiguration of  Apache Zookeeper

Dynamic Membership Changes• Necessary in every long-lived system!• Examples:

– Cloud computing: adapt to changing load, don’t pre-allocate!– Failures: replacing failed nodes with healthy ones– Upgrades: replacing out-of-date nodes with up-to-date ones– Free up storage space: decreasing the number of replicas– Moving nodes: within the network or the data center– Increase resilience by changing the set of servers Example: asynch. replication works as long as > #servers/2

operate:

Page 24: Dynamic Reconfiguration of  Apache Zookeeper

Other Dynamic Configuration Changes• Changing server addresses/ports

• Changing server roles:

• Changing the Quorum System– E.g., if a new powerful & well-connected server is added

24

leader & followers(acceptors)

observers(learners)

observers(learners) leader & followers

(acceptors)

Page 25: Dynamic Reconfiguration of  Apache Zookeeper

Reconfiguring Zookeeper• Not supported

• All config settings are static – loaded during boot

• Zookeeper users repeatedly asking for reconfig. since 2008– Several attempts found incorrect and rejected

Page 26: Dynamic Reconfiguration of  Apache Zookeeper

Manual Reconfiguration• Bring the service down, change configuration files, bring it back

up

• Wrong reconfiguration caused split-brain & inconsistency in production

• Questions about manual reconfig are asked several times each week

• Admins prefer to over-provision than to reconfigure

[LinkedIn talk @Yahoo, 2012]– Doesn’t help with many reconfiguration use-cases– Wastes resources, adds management overhead– Can hurt Zookeeper throughput (we show)

• Configuration errors primary cause of failures in production systems [Yin et al., SOSP’11]

Page 27: Dynamic Reconfiguration of  Apache Zookeeper

Hazards of Manual Reconfiguration

A

B

C

D

E

27

{A, B, C, D, E}

{A, B, C}

{A, B, C}

{A, B, C}

{A, B, C, D, E}

{A, B, C, D, E} {A, B, C, D, E}

{A, B, C, D, E}

• Goal: add servers E and D• Change configuration files• Restart all servers• We lost and !!

Page 28: Dynamic Reconfiguration of  Apache Zookeeper

Just use a coordination service!• Zookeeper is the coordination service

– Don’t want to deploy another system to coordinate it!

• Who will reconfigure that system ?– GFS has 3 levels of coordination services

• More system components -> more management overhead

• Use Zookeeper to reconfigure itself!– Other systems store configuration information in Zookeeper– Can we do the same??

• Only if there are no failures

28

Page 29: Dynamic Reconfiguration of  Apache Zookeeper

Recovery in Zookeeper

C

A

B

D

E

29

• Leader failure activates leader election & recovery

setData(/x, 5)

Page 30: Dynamic Reconfiguration of  Apache Zookeeper

This doesn’t work for reconfigurations!

C

A

B

D

E

{A, B, C, D, E}

{A, B, C, D, E}

{A, B, C, D, E} {A, B, C, D, E}

{A, B, C, D, E}

{A, B, F}

setData(/zookeeper/config, {A, B, F})

F

{A, B, F}

remove C, D, E add F

• Must persist the decision to reconfigure in the old config before activating the new config!

• Once such decision is reached, must not allow further ops to be committed in old config

Page 31: Dynamic Reconfiguration of  Apache Zookeeper

Our Solution• Correct• Fully automatic• No external services or additional components• Minimal changes to Zookeeper• Usually unnoticeable to clients

– Pause operations only in rare circumstances – Clients work with a single configuration

• Rebalances clients across servers in new configuration

• Reconfigures immediately

• Speculative Reconfiguration– Reconfiguration (and commands that follow it) speculatively

sent out by the primary, similarly to all other updates

Page 32: Dynamic Reconfiguration of  Apache Zookeeper

Principles of Reconfiguration• A reconfiguration C1 -> C2 should do the following:

1. Consensus decision in the current config (C1) on a new config (C2)

2. Transfer of control from C1 to C2:a)Suspend C1

b)Collect a snapshot S of completed/potentially completed ops in C1

c) Transfer S to a quorum of C2

3. A consensus decision in C2 to activate C2 with the “start state” S

32

Page 33: Dynamic Reconfiguration of  Apache Zookeeper

Easy in a Primary/Backup system!1. Decision on the next config is like decision on any

other operation

2. a. Instead of suspending, we guarantee that further ops can only be committed in C2 unless reconfiguration fails (primary order!)b. The primary is the only one executing & proposing ops;

S = the primary’s logc. Transfer ahead of time, here only make sure transfer complete

3. If leader stays the same, no need to run full consensus

33

Page 34: Dynamic Reconfiguration of  Apache Zookeeper

Failure-Free Flow

34

Page 35: Dynamic Reconfiguration of  Apache Zookeeper

Protocol Features• After reconfiguration is proposed, leader schedules &

executes operations as usual– Leader of the new configuration is responsible to commit these

• If leader of old config is in new config and “able to lead”, it remains the leader

• Otherwise, old leader nominates new leader (saves leader election time)

• We support multiple concurrent reconfigurations– Activate only the “last” config, not intermediate ones– In the paper, not in production

35

Page 36: Dynamic Reconfiguration of  Apache Zookeeper

Recovery – Discovering Decisions

C

B

A

D

E

36

{A, D, E}

{A, B, C}

{A, B, C}

{A, B, C}

{A, D, E}

{A, D, E}• : replace B, C with E, D• C must 1) discover possible decisions in {A, B, C}

(find out about {A, D, E}) 2) discover possible activation decision in {A, D, E}

- If {A,D, E} is active, C mustn’t attempt to transfer state- Otherwise, C should transfer state & activate {A, D, E}

Page 37: Dynamic Reconfiguration of  Apache Zookeeper

Recovery – Challenges of Configuration Overlap

C B

A

D

37

{A, B, C} {A, B, C}

{A, B, C}

• : replace C with D• Only B can be the leader of {A, B, C}, but it won’t -> No leader will be elected in {A, B, C}• D will not connect to B (doesn’t know who B is)

-> B will not be able to lead {A, B, D}

{A, B, D}

Page 38: Dynamic Reconfiguration of  Apache Zookeeper

Recovering Operations submitted while reconfig is pending

C

B

A

D

E

38

{A, D, E}

{A, B, C}

{A, B, C}

{A, B, C}

• Old leader may execute and submit them • New leader is responsible to commit them, once persisted on quorum of

NEW config• However, because of the way recovery in ZAB works, the protocol is

easier to implement if such operations are persisted also on quorum of old config before committed in addition to new config

• Leader commits and , but no one else knows• In ZAB C can’t fetch history from E/D. Thus, C must already have

Page 39: Dynamic Reconfiguration of  Apache Zookeeper

The “client side” of reconfiguration• When system changes, clients need to stay connected

– The usual solution: directory service (e.g., DNS)• Re-balancing load during reconfiguration is also important!• Goal: uniform #clients per server with minimal client migration

– Migration should be proportional to change in membership

39

X 10 X 10 X 10

Page 40: Dynamic Reconfiguration of  Apache Zookeeper

Consistent Hashing ?• Client and server ids randomly mapped to a circular m-

bit space (0 follows 2m – 1)• Each client then connects to the server that

immediately follows it in the circle

• In order to improve load balancing each server is hashed k times (usually k log(#servers))

k = 5 k = 20

Load balancing is uniform W.H.P.

depending on #servers and k .

In Zookeeper #servers < 10

remove 1 remove 2 add 3remove 1 add 1

• 9 servers, 1000 clients

• Each client initially connects to a random server

Page 41: Dynamic Reconfiguration of  Apache Zookeeper

Our approach - Probabilistic Load Balancing

• Example 1 :

– Each client moves to a random new server with probability 0.4• 1 – 3/5 = 0.4

– Exp. 40% clients will move off of each server• Example 2 :

– Clients connected to D and E don’t move– Clients connected to A, B, C move to D, E with probability 4/9

• |S S’|(|S|-|S’|)/|S’||S’\S| = 2(5-3)/3*3 = 4/9 – Exp. 8 clients will move from A, B, C to D, E and 10 to F

X 10 X 10 X 10X 6 X 6 X 6 X 6 X 6

X 10 X 10X 10X 6 X 6 X 6 X 6 X 6

A B C D E F

4/184/18 10/18

Page 42: Dynamic Reconfiguration of  Apache Zookeeper

Probabilistic Load Balancing When moving from config. S to S’:

Solving for Pr we get case-specific probabilities. Input: each client answers locally Question 1: Are there more servers now or less ? Question 2: Is my server being removed? Output: 1) disconnect or stay connected to my server

if disconnect 2) Pr(connect to one of the old servers)

and Pr(connect to newly added server)

ijSjijSj

jiSiloadijSjloadSiloadSiloadE'

)Pr(),()Pr(),(),())',((

expected #clients connected to i in S’(10 in last example)

#clients connected

to i in S#clients

moving to i from other servers in S

#clients moving from i to

other servers in S’

Page 43: Dynamic Reconfiguration of  Apache Zookeeper

Implementation• Implemented in Zookeeper (Java & C), integration

ongoing– 3 new Zookeeper API calls: reconfig, getConfig,

updateServerList– feature requested since 2008, expected in 3.5.0 release (july

2012)• Dynamic changes to:

– Membership– Quorum System– Server roles– Addresses & ports

• Reconfiguration modes:– Incremental (add servers E and D, remove server B)– Non-incremental (new config = {A, C, D, E})– Blind or conditioned (reconfig only if current config is #5)

• Subscriptions to config changes– Client can invoke client-side re-balancing upon change

Page 44: Dynamic Reconfiguration of  Apache Zookeeper

Evaluationremove 4 servers remove 2 serversremove add remove-leader add remove add

Page 45: Dynamic Reconfiguration of  Apache Zookeeper

Summary• Design and implementation of reconfiguration for Apache

Zookeeper– being contributed into Zookeeper codebase

• Much simpler than state of the art, using properties already provided by Zookeeper

• Many nice features:– Doesn’t limit concurrency– Reconfigures immediately– Preserves primary order– Doesn’t stop client ops

• Zookeeper used by online systems, any delay must be avoided– Clients work with a single configuration at a time– No external services– Includes client-side rebalancing

45

Page 46: Dynamic Reconfiguration of  Apache Zookeeper

Questions?

46


Recommended