ZooKeeper
CSCE 678
Weak vs Strong Consistency
• In distributed systems, consistency is often the target of weakening
• Examples of weak consistency:• NoSQL servers – Dynamo• Datanodes in HDFS
• But sometimes, strong consistency is needed
2
Use Cases of Strong Consistency
• Configuration management
• Message Queues
• Group membership
• Synchronization:• Mutexes and read/write locks• Barriers: process joins
3
(Will talk about them one by one)
Use Case: Configuration & Group Membership
4
Primary[Configuration]Primary: …Secondaries: …Port IDs: …Created by: …Push
Use Case: Message Queues
5
ProducerA
ProducerB
Consumer
enqueue
enqueue
dequeue
Use Case: Synchronization
• Mutexes
6
• Read/write locks
lock();
last_x = read(x);write(y, last_x);write(x, last_x + 1);
unlock();
rd_lock();
if (queue.size > 0) {wr_lock();x = queue.dequeue();wr_unlock();
}
rd_unlock();
Shared among readers
Exclusive for one writer
How to Define Strong Consistency?
• Linearizability• Also called atomic consistency or immediate consistency• As soon as a write operation finishes, the whole system
should see the latest data.
7
Serializability vs Linearizability
• For distributed systems, these two words are used in very specific contexts
• Serializability: for databases• Equivalent to Serializable Isolation (I) in ACID• No ordering for concurrent transactions
• Linearizability: for strongly-consistent reads/writes• In respect to operations, not transactions• Considering the global ordering of reads/writes
8
Serializability vs Linearizability
• Serializability
9
Client A
Client B
read(x), write(y), read(y)
read(y), write(x), write(y) read(y), read(x), write(y)
System read(x), write(y), read(y) read(y), write(x), write(y) read(y), read(x), write(y)
Transactions
No constraint of ordering between concurrent transactions.
(from A) (from B) (from B)
time
Serializability vs Linearizability• Actually, serializability can be implemented by linearizability
(i.e., a lock service)
10
Server 1
Server 2
read(x), write(y), read(y)
lock(x)
lock(y)
read (shared) lock
write (exclusive) lock
read(y), write(x), write(y)
blocked
ok ok
ok ok
Two-phase lock (2PL) Irrelevant to two-phase commit (2PC)
Serializability vs Linearizability
• LinearizabilityAs soon as a write operation finishes, the whole system should see the latest data.
11
Client A write(x, 1) => oktime
Server 1
Server 2insert
Client B read(x) => 1
More on Linearizability
• Reads concurrent with a write
12
Client Atime
Client B
read(x) => 0
Client C write(x, 1) => ok
read(x) => 1
read(x) => 1
Any read that begins before the writemust see the old value.
Any read that begins AFTER the write may see old or new value; but as soon as one client see the new value, all other clients should too.
More on Linearizability
• Write concurrent with another write
13
Client Atime
Client B
Client C
write(x, 1) => ok
read(x) => 1
Any write begins before another writemust be committed first.
write(x, 2) => ok
read(x) => 2
Linearizability vs Causality• Linearizability implies causal consistency
14
(1) Causality is based on happens-before relationshipin a client
If client sets x = 1 and sets y = 0, then the two values have causal relationship.
Client Atime
Client B
write(y, 0)write(x, 1)
read(y) => 0
Causality
read(x) => 0
Violate causal relationship!
read(x) => 0
15
Linearizability vs Causality• Linearizability implies causal consistency
(2) Linearizability implies total ordering of each object, soautomatically preserves happens-before.
A linearizable system doesn’t have to do anything to preserve causal consistency.
Client Atime
Client B
write(y, 0)write(x, 1)
read(y) => 0 read(x) => 1
read(x) => 0
Total order
How to Implement Linearizability?
• Read/write from single leader• Failover to a replica may lose linearizability
• Consensus algorithms• Using two-phase commits (2PC) or Paxos• Prevent stale replicas
• Most likely unlinearizable: multi-leader replication
16
Q: How does linearizability apply to CAP Theorem?
Linearizability in CAP Theorem
• Linearizability = Strong C• Read/write through single leader: lose A & P• Read through replicas, write through leader: lose P
• With consensus, linearizability can be:• Partition tolerant when ½ of replicas are connected• Fair availability with wait-free operations and fast,
lossless leader recovery
17
Apache ZooKeeper
• A coordination service for all use cases of strong consistency in a distributed system
• Wait-free: any operation will not block on other slow or failed clients
• ZooKeeper has no API for locking, but can be used to implement any locking mechanism
18
System Overview
19
ZooKeeper Service
ServerServerServer Server Server
Follower LeaderFollower Follower Follower
Client Client Client Client Client Client Client
Forward operations
Namespace
• ZooKeeper uses a filesystem-like namespace
20
/
/App1 /App2
/App1/p_1 /App1/p_2 /App1/p_3
…
znodes
Data Data Data Not large data files
(more like metadata)
Data
Namespace
• ZooKeeper has two types of znodes (paths)• Permanent (regular): clients explicitly create and delete
the znodes• Ephemeral: clients create the znodes, and either delete
them explicitly or let the system automatically deletes them when client sessions timeout.
21
API• create(path, data, flags)
• flags: regular/ephemeral, sequential (appending a seqnum)
• getData(path, watch) -> (data, version)
• setData(path, data, version)
• delete(path, version)
• exists(path, watch) -> true/false
• getChildren(path, watch) -> [paths]
• sync(path) path is ignored now
22
Asynchronous Operations• Operations can be synchronous or asynchronous
• Client can queue up multiple asynchronous operations
• Server responds by invoking callbacks
23
ZooKeeper Se
ServerServerServer
Follower LeaderFollower
Forward operati
Client
getData(x) create(y) …
callback for getData(x) callback for create(y) …
Event Notification
• All operations except sync are wait-free
• No locking API but clients can implement locks using watch events
24
1 l = “/my-lock”;2 if exists(l, watch=true) then wait for watch event;3 n = create(l, EPHEMERAL);4 if n is error then goto 2;
1 delete(l);
Lock (very naïve version)
Unlock
Server will push events to clientwhen the path is updated
A(asynchronous)-Linearizability
• Local order: all operations from the same client are processed FIFO (first-in-first-out)
• Global order: Linearizable writes
25
ServerClient A
ServerClient B
write Op 1, write Op 2, …
write Op 1, write Op 2, …
Some consensus to ensurethe write ops are appliedexactly as their global order.
A(asynchronous)-Linearizability• Reads are not linearized (may not see latest state)
• Read directly from server (identical replicas)• Other servers may have pending writes in queues• Solution: sync after writes
26
ServerClient A
ServerClient B
write Op 1, write Op 2, sync
read Op 1, read Op 2, …
After sync, all clients willsee the changes of priorwrites from client A.
Zab (Atomic Broadcast)
• All servers forward messages to a single leader• Important role as a sequencer• Leader can change if partitioned or failed
• The leader broadcasts (proposes) the messages to be delivered to all followers
27
ZooKeeper Service
ServerServerServer Server Server
Follower LeaderFollower Follower Follower
Forward operations
Zab (Atomic Broadcast)
• Two-phase commit (2PC)
28
Leader
Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request
Follower Follower
Zab (Atomic Broadcast)
• Two-phase commit (2PC)
29
Leader
Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request
Follower Follower
Step 2. Propose the changewith zxid to followers
Zab (Atomic Broadcast)
• Two-phase commit (2PC)
30
Leader
Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request
Follower Follower
Step 2. Propose the changewith zxid to followers
Step 3. Followersacknowledge theproposal.
Q: In what situation maythe follower reject the proposal?
Zab (Atomic Broadcast)
• Two-phase commit (2PC)
31
Leader
Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request
Follower Follower
Step 2. Propose the changewith zxid to followers
Step 3. Followersacknowledge theproposal.
Step 4. Leader commitsthe change if receives morethan ½ of acks (votes).
Leader Election
• If a server finds the leader disconnected or failed, it tries to become the leader (same 2PC protocol).
32
LeaderCandidate
Follower Follower
Becomes the leader whenreceives ½ of votes.
If followers receive multiple proposal,they vote for the candidate with the highest zxid.
References
• “ZooKeeper: Wait-free coordination for Internet-scale systems,” USENIX ATC ‘10 (by Hunt et al.)
• Zab protocol: “A simple totally ordered broadcast protocol”, LADIS ’08 (by Reed and Junqueira)
• “Linearizability: A Correctness Condition for Concurrent Objects”, TOPLAS 1990 (by Herlihy and Wing)
• “Designing Data-Intensive Applications”, O’Reilly 2017(by Martin Kleppmann)
33