FAULT‐TOLERANCE IN LARGE DATA SYSTEMS (CONTRASTING DYNAMO’S AND BIGTABLE’S APPROACHES)
NYU Distributed System Class, Invited Lecture, Dec/2009
Key‐value Stores
• Bigtable and Dynamo (Cassandra) represent a class of systems that sits in between Distributed File Systems and Database Management Systems
• They provide more funcLonality than a file system in that they export a more powerful API (data model) than reads, writes, and seeks.
• This data model is rich enough to build applicaLons directly atop of it but has less querying capabiliLes than a full‐fledged database with a query language
They have been deployed in very large data management scenarios.
Intro
10.000 G view of a server A server transforms requests against an abstract key‐value model into file reads and writes. A system is a group of servers cooperaLng to support what looks like a unified instance of the data model.
S
Reads Writes Seeks Appends
c Gets Puts Scans
key
k1
col‐famA: c1 c2 …
…
All rows are keyed
Columns are namespaces
Cells are Lme series
Server translates
Intro
Dynamo uses Consistent Hashing to distribute rows Logic of key locaLon can be pushed to clients. Or the client can ask a node to redirect requests.
A
User Data System Data
B A
B
System Data
B A
User Data
c
1) give me the key ‘d’ (orange hash)
2) fetch key ‘d’
A
B
Intro
Bigtable uses key ranges Lookup consists of traversing the Metadata (parLLons) table.
A B
P2
P1
User Data
k o
User Data System Data
Tk B
c 1) which parLLon would
know about user table ‘T’ and key ‘d’?
3) what is ‘d’s value?
k
A B
Tk T
8 A
2) where’s key ‘d’?
a d
System Data 8 Tk
8
T 8
T P1
‐>k
Master
B P2 ‐> A 8
Intro
Designing large systems You’ve seen a lot of useful techniques during this course.
• Why build such systems? In short, scalability and availability in the level these companies needed was not ready, off‐the‐shelf
There is also the strategic argument of not depending on third party socware for core funcLonality
• How to go about the choice of algorithms in the design? • Start with what is important for users: e.g. an “always writable” system for Amazon
• Leverage exisLng technology: e.g. use of sstables at Google
• That will Le the relevant algorithms; in turn, the design space gets reduced
Intro
Fault Tolerance Features Dynamo and Bigtable made quite different choices. Among them, consistency and availability decisions may have been the most important ones.
Replicate Resync Detect Faults
Failover
Intro
Dynamo allows parLLon replicas to be “far” apart …
…toleraLng inconsistencies during failover…
… that have to be merged acer the failure is circumvented.
Bigtable parLLon replicas must be “close”…
… so failover may mean short unavailability…
… but data is never out of sync. Now, off‐site replicaLon needs a different mechanism.
StarSng with copies of data, Dynamo first EffecLvely, every server is taking care of the W ranges before it. Since these ranges overlap, there is data redundancy.
2) Write is forwarded to the “preference list” (sort of a write quorum). In this case, the two succeeding servers
c 1) Write to ‘k’ ?
0 1
A
B
C
k
A B C
Replicate
Bigtable delegates to GFS The difference here is that a server stores data in a networked file system.
B A
c
GFS
1) Write to ‘k’ ?
2) Write is done by GFS, which effecLvely makes N copies of it
Replicate
Data Consistency Write quorum can be relaxed in Dynamo. Normally, not so in GFS. So a Dynamo client accepts reading inconsistent copies (with vector clocks). A Bigtable client retry writes on failures (writes are made idempotent).
0 1
A
B
C
k
A B C
A
c
GFS
c
Replicate
D
D
In Dynamo, how is a faulty server detected? Gossip‐based protocol
• Every t Lmes, each server reaches out to a random peer
• During this exchange, the pair trade informaLon and update their view of the servers in the system
• Assuming N servers, in O(log N) each server would know about every other
• It would also learn about the most recent Lme that a server was successfully contacted
• The longer it takes for a server to be contacted, the largest the probability of it being down or unreachable
• A server is considered to be out of the ring roughly if enough peers haven’t managed to contact it for a period of Lme
Fault Det.
Group membership can be done through Paxos Chubby runs Paxos internally for its replicated state machine scheme. It exports abstracLons such as locks and watchers over files.
A B
Master
Chubby
1) Lock “Mycell/B” file 2) Set a watcher on “Mycell/” directory
3) If B crashes its lock is gone. If lose touch with Chubby for too long, dijo.
4) Watcher on “Mycell/B” fires and Master assumes B is out of the group
k
a d P1
Fault Det.
Dynamo allows wriSng through a failover If one wishes to set a low write quorum, one can.
2) A different client wriLng to ‘k‐>v2’ though B would not reach A.
c 1) Write to ‘k‐>v1’ now only goes to A, which is reachable
0 1
A
B
C
k
A B C
c
3) During the failure, the Replicas will be inconsistent
v1 v2 v2
Failover
Bigtable reassigns parSSons to other servers Is it really just reopening the files that make up P1?
A B
Master
k
a d P1
GFS
Load parLLon P1
Failover
Vector clocks help resync‐ing aGer a failover
Resync
A B C
v0 v0 v0
A B C
v1 v2 v2
A B C
v3 v3 v3
A writes v0 at (A,t1) v0 (A, t1)
B writes v2 at (B,t1) v2(A,t1 / B,t1)
A writes v1 at (A,t2) v1 (A,t1 / A, t2)
A writes v3 at (A,t2) v3(A, t1 / A, t2 / B, t1)
If clocks show conflicts, then it is up to the applicaLon to resolve them. “Hinted‐handoff” also as resync work.
No resync in Bigtable. But log sorSng for quick failover.
• A “mutaLon” gets wrijen to a commit log first • There is one log PER SERVER, that is, writes to several parLLons are interleaved in this log
• To load a parLLon, the new server has to access the log records for that parLLon…
• Why? Some changes may not yet be reflected on data • This has to be done on a failover situaLon, that is, as quickly as possible
• SoluLon: Bigtable is a distributed sorLng system as well. Log parLLons are sorted to speed up moving “tablets”
Resync
It could be seen as a “pre‐sync” cost.
Comparing the approaches
Wrap‐up
• Relies on GFS for data copies • Centralized group membership and fault‐detecLon • Failover means having another server pick up the work • No resync necessary (but sorts log to move parLLons fast)
• Manages data copies itself • Local fault‐detecLon • Failover means client will write to another server (if minimum quorum) • Pays not to move data (resync with vector clocks)
References
• Bigtable: A Distributed Storage System for Structured Data, Chang et al., OSDI’06 • The Chubby Lock Service for Loosely Coupled Distributed Systems, Mike Burrows, OSDI’06 • Paxos Made Simple, Leslie Lamport
• Dynamo: Amazon’s Highly Available Key Value Store, DeCandia et al., SOSP’07 • Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the WWW, STOC’97 • Chord: A Scalable Peer‐to‐Peer Lookup Service for Internet ApplicaLons, Stoica et al, SIGCOMM’01
Wrap‐up