Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | clifford-collins |
View: | 213 times |
Download: | 0 times |
Using Paxos to Build a Scalable, Consis-tent, and Highly Available Datastore
Jun Rao, Eugene J. Shekita, Sandeep Tata
IBM Almaden Research Center
PVLDB, Jan. 2011, Vol. 4, No. 4
2011-03-25
Presented by Yongjin Kwon
Copyright 2011 by CEBT
Outline
Introduction
Spinnaker
Data Model and API
Architecture
Replication Protocol
Leader Election
Recovery
Follower Recovery
Leader Takeover
Experiments
Conclusion
2
Copyright 2011 by CEBT
Introduction
Cloud computing applications have aggressive require-ments.
Scalability
High and continuous availability
Fault Tolerance
The CAP Theorem [Brewer 2000] argues that among Consis-tency, Availability, and Partition tolerance, only two out of three are possible.
Recent distributed systems such as Dynamo or Cassan-dra provide high availability and partition tolerance by sacrificing consistency.
Guarantee eventual consistency.
May cause diverse versions of replicas.3
Copyright 2011 by CEBT
Introduction (Cont’d)
Most applications will desire stronger consistency guar-antees.
e.g. single datacenter where network partitions are rare
How to preserve consistency?
Two-Phase Commit
– Blocking exists when the coordinator fails
Three-Phase Commit [Skeen 1981]
– Seldom used because of poor performance
Paxos Algorithm
– Generally perceived as too complex and slow
4
Copyright 2011 by CEBT
Introduction (Cont’d)
Timeline Consistency [Cooper 2008]
Stops short of full serializability.
All replicas of a record apply all updates in the same order.
At some time, any replica will be one of diverse versions from the timeline.
With some modifications of Paxos, it is possible to pro-vide high availability while ensuring at least timeline consistency with a very small loss of performance.
5
Insert Update Update Update Delete
timeline
Copyright 2011 by CEBT
Spinnaker
Experimental datastore
Designed to run on a large cluster of commodity servers in a single datacenter
Key-based range partitioning
3-way replication
Strong or timeline consistency
Paxos-based protocol for replication
Example of a CA system
6
Copyright 2011 by CEBT
Data Model and API
Data Model
Similar to Bigtable and Cassandra
Data is organized into rows and tables.
– Each row in a table can be uniquely identified by its key.
A row may contain any number of columns with correspond-ing values and version numbers.
API
get(key, colname, consistent)
put(key, colname, colvalue)
delete(key, colname)
conditionPut(key, colname, value, version)
conditionDelete(key, colname, version)
7
Copyright 2011 by CEBT
Architecture
System Architecture
Data (or rows in a table) are distributed across a cluster us-ing (key-)range partitioning.
Each group of nodes in a key range is called a cohort.
– Cohort for [0, 199] : { A, B, C }
– Cohort for [200, 399] : { B, C, D }
8
Copyright 2011 by CEBT
Architecture (Cont’d)
Node Architecture
All the components are thread safe.
For logging
– Shared write-ahead log is used for performance
– Each log record is uniquely identified by an LSN (log sequence number).
– Each cohort on a node uses its own logical LSNs.
9
Logging andLocal Recovery
Commit QueuememtablesSSTables
Replication andRemote Recovery
Failure Detection,Group Membership, and Leader Selection
Copyright 2011 by CEBT
Replication Protocol
Each cohort consists of a elected leader and two follow-ers.
Spinnaker’s Replication Protocol
Modification of the basic Multi-Paxos protocol
– Shared write-ahead log, Not missing any log entries
– Reliable in-order messages based on TCP sockets
– Distributed coordination service for leader election (Zookeeper)
Two Phases of Replication Protocol
Leader Election Phase
– A leader is chosen among the nodes in a cohort.
Quorum Phase
– The leader proposes a write.
– The followers accept it.
10
Copyright 2011 by CEBT
Replication Protocol (Cont’d)
Quorum Phase
Client summits a write W.
The leader, in parallel,
– appends a log record for W, and forces it to disk,
– appends W to its commit queue, and
– sends a propose message for W to its followers.
11
write W
Leader
Follower Follower
Cohort
propose W propose W
Copyright 2011 by CEBT
Replication Protocol (Cont’d)
Quorum Phase
After receiving the propose message, the followers
– appends a log record for W, and forces it to disk, and
– appends W to its commit queue, and
– sends an ACK to the leader.
12
Leader
Follower Follower
Cohort
ACK ACK
Copyright 2011 by CEBT
Replication Protocol (Cont’d)
Quorum Phase
After the leader gets an ACK from “at least one” follower, the leader
– applies W to its memtable, effectively committing W, and
– sends a response to the client.
There is no separate commit record that needs to be logged.
13
Leader
Follower Follower
CohortCommit-ted!
Copyright 2011 by CEBT
Replication Protocol (Cont’d)
Quorum Phase
Periodically the leader sends an asynchronous commit message to the followers, with a certain LSN, asking them to apply all pending writes up to the LSN, to their memta-bles.
For recovery, the leader and followers save this LSN, re-ferred to as the last committed LSN.
14
Leader
Follower Follower
Cohort
commit LSN commit LSN
Copyright 2011 by CEBT
Replication Protocol (Cont’d)
For strong consistency,
Reads are always routed to the cohort’s leader.
Reads are guaranteed to see the latest value.
For timeline consistency,
Reads can be routed to any node in the cohort.
Reads may see a stale value.
15
Copyright 2011 by CEBT
Leader Election
The leader election protocol has to guarantee that
the cohort will appear a majority (i.e. two nodes) and
the new leader is chosen in a way that no committed writes are lost.
With the aid of Zookeeper, this task can be simplified.
Each node includes a Zookeeper client.
Zookeeper [Hunt 2010]
Fault tolerant, distributed coordination service
It is only used to exchange messages between nodes.
Ref : http://hadoop.apache.org/zookeeper/
16
Copyright 2011 by CEBT
Leader Election (Cont’d)
17
Zookeeper’s Data Model
Resembles a directory tree in a file system.
Each node, znode, is identified by its path from the root.
– e.g. /a/b/c
A znode can include a sequential attribute.
Persistent znode vs. Ephemeral znode
a
b
… … …c
Copyright 2011 by CEBT
Leader Election (Cont’d)
Note that information needed for leader election is stored in Zookeeper under “/r”.
Leader Election Phase
One of the cohort’s nodes cleans up any state under /r.
Each node of the cohort adds a sequential ephemeral zn-ode to /r/candidates with value “last LSN.”
After a majority appears under /r/candidates, the new leader is chosen as the candidate with the max “last LSN.”
The leader adds an ephemeral znode under /r/leader with value “its hostname,” and execute leader takeover.
The followers learn about the new leader by reading /r/leader.
18
Copyright 2011 by CEBT
Leader Election (Cont’d)
Verification of the guarantee that no committed writes are lost
Committed write is forced to the logs of at least 2 nodes.
At least 2 nodes have to participate in leader election.
Hence, at least one of the nodes participating in leader election will have the last committed write in its log.
Choosing the node with max “last LSN” ensures that the new leader will have this committed write in its log.
If committed writes are not unresolved on the other nodes, leader takeover will make sure that it is re-pro-posed.
19
Copyright 2011 by CEBT
Recovery
When a cohort’s leader and followers fails, the recovery should be performed, using log records, after they come back up.
Two Recovery Processes
Follower Recovery
– When a follower or even leader fails, how can the node be re-covered after it comes back up?
Leader Takeover
– When a leader has failed, what should the new leader perform after leader election?
20
Copyright 2011 by CEBT
Follower Recovery
The follower recovery is executed whenever a node comes back up after a failure.
Two Phases of Follower Recovery
Local Recovery Phase
– Re-apply log records from its most recent checkpoint through its last committed LSN.
– If the follower has lost all its data due to a disk failure, then it moves to the catch up phase immediately.
Catch Up Phase
– Send its last committed LSN to the leader.
– The leader responds by sending all committed writes after the follower’s last committed LSN.
21
…
checkpoint last committed LSNlast LSN
Local Recovery Catch Up
Copyright 2011 by CEBT
Follower Recovery (Cont’d)
If a leader went down and a new leader was elected, it would be possible that the new leader neglected some of the log records after the last committed LSN.
The discarded log records should be removed so that they are never re-applied by future recovery.
Logical Truncation of the follower’s log
The LSNs of log records belonging to the follower are stored in a skipped LSN list.
Before processing log records, check the skipped LSN list whether the log record should be discarded.
22
Copyright 2011 by CEBT
Leader Takeover
When a leader fails, the corresponding cohort becomes unavailable for write.
Execute the leader election to choose a new leader!
After a new leader is elected, leader takeover occurs.
Leader Takeover
Catch up each follower to the new leader’s last committed LSN.
– This step may be ignored by the follower.
Re-propose the writes between leader’s last committed LSN and leader’s last LSN, and commit using the normal replica-tion protocol.
23
…
follower’s last committed LSN
leader’s last com-mited LSN
leader’s last LSN
Catch up Re-proposal
Copyright 2011 by CEBT
Recovery (Cont’d)
24
Follower Recover
Follower goes down while the others are still alive.
The cohort accepts new writes.
When the follower comes back up, the follower is recov-ered.
Cohort
Leadercmt : 1.20lst : 1.21
Follower
cmt : 1.10lst : 1.20
Follower
cmt : 1.10lst : 1.22
cmt : 1.25lst : 1.25
cmt : 1.20lst : 1.25
cmt : 1.25lst : 1.25
Copyright 2011 by CEBT
Recovery (Cont’d)
25
Leader Takeover
Leader goes down while the others are still alive.
The new leader is elected, and leader takeover is executed.
The cohort accepts new writes.
When the old leader comes back up, it is recovered.
Cohort
Leadercmt : 1.20lst : 1.21
Follower
cmt : 1.10lst : 1.19
Follower
cmt : 1.10lst : 1.20
cmt : 2.30lst : 2.30
cmt : 1.20lst : 1.20
cmt : 1.20lst : 1.20
Leader
cmt : 2.30lst : 2.30
cmt : 2.30lst : 2.30
Followerlogical truncation
(LSN 1.21)
Copyright 2011 by CEBT
Experiments
Experimental Setup
Two clusters (one for datastore, the other for clients) of 10 nodes, each of which consists of
– Two quad-core 2.1 GHz AMD processors
– 16GB memory
– 5 SATA disks, with 1 disk for logging (without write-back cache)
– Rack-level 1Gbit Ethernet switch
Cassandra trunk as of October 2009
Zookeeper version 3.20
26
Copyright 2011 by CEBT
Experiments (Cont’d)
In these experiments, Spinnaker was compared with Cassandra.
Common things
– Implementation of SSTables, memtables, log manager
– 3-way replication
Different things
– Replication protocol, recovery algorithms, commit queue
Cassandra’s weak/quorum reads
– Weak read accesses just 1 replica.
– Quorum read accesses 2 replicas to check for conflicts.
Cassandra’s weak/quorum writes
– Both are sent to all 3 replicas.
– Weak write waits for an ACK from just 1 replica.
– Quorum write waits for ACKs from any 2 replicas.27
Copyright 2011 by CEBT
Experiments (Cont’d)
28
Copyright 2011 by CEBT
Conclusion
Spinnaker
Paxos-based replication protocol
Scalable, consistent, and highly available datastore
Future Work
Support for multi-operation transactions
Load balancing
Detailed comparison to other datastores
29
Copyright 2011 by CEBT
References
[Brewer 2000] E. A. Brewer, “Towards Robust Distributed Systems,” In PODC, pp. 7-7, 2000.
[Cooper 2008] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, R. Yerneni, “PNUTS: Yahoo!’s Hosted Data Serving Platform,” In PVLDB, 1(2), pp. 1277-1288, 2008.
[Hunt 2010] P. Hunt, M. Konar, F. P. Junqueira, B. Reed, “Zookeeper: Wait-Free Coordination for Internet-scale Systems,” In USENIX, 2010.
[Skeen 1981] D. Skeen, “Nonblocking Commit Protocols,” In SIGMOD, pp. 133-142, 1981.
30