Case Studies: Chubby and BigTable
Yao Liu
The Chubby lock service for loosely-coupled distributed systems
Mike Burrows (Google), OSDI 2006
Slides ack to: Shimin Chen: http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/slides/shimin-chubby.ppt
Introduction • What is Chubby?
• Lock service in a loosely-coupled distributed system (e.g., 10K 4-processor machines connected by 1Gbps Ethernet)
• Client interface similar to whole-file advisory locks with notification of various events (e.g., file modifications)
• Primary goals: reliability, availability, easy-to-understand semantics
• How is it used? • Used in Google: GFS, Bigtable, etc. • Elect leaders, store small amount of meta-data, as the root
of the distributed data structures
CS 457/557 Introduction to Distributed Systems 3
System architecture
• A chubby cell consists of a small set of servers (replicas) • A master is elected from the replicas via a consensus protocol
• Master lease: several seconds • If a master fails, a new one will be elected when the master leases expire
• Client talks to the master via chubby library • All replicas are listed in DNS; clients discover the master by talking to any
replica
a replica
CS 457/557 Introduction to Distributed Systems 4
System architecture (2)
• Replicas maintain copies of a simple database • Clients send read/write requests only to the master • For a write:
• The master propagates it to replicas via the consensus protocol • Replies after the write reaches a majority of replicas
• For a read: • The master satisfies the read alone
a replica
CS 457/557 Introduction to Distributed Systems 5
System architecture (3)
• If a replica fails and does not recover for a long time (a few hours) • A fresh machine is selected to be a new replica, replacing the failed one • It updates the DNS • Obtains a recent copy of the database • The current master polls DNS periodically to discover new replicas
a replica
CS 457/557 Introduction to Distributed Systems 6
Master election • At any point in time, there must be at most one master • No two nodes must think they are masters at same time
• Example: • Suppose A is master and it gets disconnected from B • B times out trying to talk to A, thinks A is dead, and
proposes that it be the master • If other nodes agree and A doesn’t hear about the new
master, then A will continue to act as master for a while, accepting read requests for what could be stale data, for example
CS 457/557 Introduction to Distributed Systems 7
Master election in Chubby • When a master dies, a node proposes a master change through Paxos
• When nodes receive the proposal, they will only accept it if the old master’s lease has expired
• A node becomes the master if a majority of nodes have given it the accept to become the master
• Once a node becomes a master, it knows that it will remain so for at least the lease period • It can extend the lease by getting the accept from a
majority of the nodes
CS 457/557 Introduction to Distributed Systems 8
Chubby interface: UNIX-like file system interface
• Chubby supports a strict tree of files and directories • No symbolic links, no hard links • /ls/foo/wombat/pouch
• 1st component (ls): lock service (common to all names) • 2nd component (foo): the chubby cell (used in DNS lookup to find the
cell master) • The rest: name inside the cell
• Can be accessed via Chubby’s specialized API / other file system interface (e.g., GFS)
• Support most normal operations (create, delete, open, write, …)
• Support advisory reader/writer lock on a node
CS 457/557 Introduction to Distributed Systems 9
Chubby events • Clients can subscribe to events (up-calls from Chubby library) • File contents modified: if the file contains the location of
a service, this event can be used to monitor the service location
• Master failed over • Child node added, removed, modified • Handle becomes invalid: probably communication
problem • Lock acquired (rarely used) • Locks are conflicting (rarely used)
CS 457/557 Introduction to Distributed Systems 10
APIs • Open()
• Mode: read/write/change ACL; Events; Lock-delay • Create new file or directory?
• Close() • GetContentsAndStat(), GetStat(), ReadDir() • SetContents(): set all contents; SetACL() • Delete() • Locks: Acquire(), TryAcquire(), Release() • Sequencers: GetSequencer(), SetSequencer(),
CheckSequencer()
CS 457/557 Introduction to Distributed Systems 11
Example: Primary Election Open(“/ls/foo/OurServicePrimary”, “write mode”); If (successful) { // primary
SetContents(“identity”); } Else { // replica open (“read mode”, “file-modification event”); when notified of file modification: primary= GetContentsAndStat(); }
CS 457/557 Introduction to Distributed Systems 12
Caching • Sequantial consistency: easy to understand
• Lease based • master will invalidate cached copies upon a write
request • Write-through caches
CS 457/557 Introduction to Distributed Systems 13
Sessions, keep-alives, master fail-overs • Session:
• A client sends keep-alive requests to a master • A master responds by a keep-alive response • Immediately after getting the keep-alive response, the client
sends another request for extension • The master will block keep-alives until close the expiration
of a session • Extension is default to 12s
• Clients maintain a local timer for estimating the session timeouts (time is not perfectly synchronized)
• If local timer runs out, wait for a 45s grace period before ending the session • Happens when a master fails over
CS 457/557 Introduction to Distributed Systems 14
Master fail-over
lease M3
no master
lease M1lease M2-- -
??
KeepAlives
lease C1- -
--
� -
6jeopardy
⇤⇤⇤⇤⌫
⇤⇤⇤⇤⌫C
CCCW
CCCCW ⇤⇤⇤⇤⌫
⇤⇤⇤⇤⌫
6
safe
lease C2
old master dies new master elected
1
2
3 4 6
5
grace period
⇤⇤⇤⇤⌫ 8
7
CCCCW
OLD MASTER NEWMASTER
CLIENTlease C3
Figure 2: The role of the grace period in master fail-over
fore the end of the client’s grace period, the client enablesits cache once more. Otherwise, the client assumes thatthe session has expired. This is done so that Chubby APIcalls do not block indefinitely when a Chubby cell be-comes inaccessible; calls return with an error if the graceperiod ends before communication is re-established.The Chubby library can inform the application when
the grace period begins via a jeopardy event. When thesession is known to have survived the communicationsproblem, a safe event tells the client to proceed; if thesession times out instead, an expired event is sent. Thisinformation allows the application to quiesce itself whenit is unsure of the status of its session, and to recoverwithout restarting if the problem proves to be transient.This can be important in avoiding outages in serviceswith large startup overhead.If a client holds a handle H on a node and any oper-
ation on H fails because the associated session has ex-pired, all subsequent operations on H (except Close()and Poison()) will fail in the same way. Clients can usethis to guarantee that network and server outages causeonly a suffix of a sequence of operations to be lost, ratherthan an arbitrary subsequence, thus allowing complexchanges to be marked as committed with a final write.
2.9 Fail-overs
When a master fails or otherwise loses mastership, it dis-cards its in-memory state about sessions, handles, andlocks. The authoritative timer for session leases runs atthe master, so until a new master is elected the sessionlease timer is stopped; this is legal because it is equiva-lent to extending the client’s lease. If a master electionoccurs quickly, clients can contact the new master beforetheir local (approximate) lease timers expire. If the elec-tion takes a long time, clients flush their caches and waitfor the grace period while trying to find the new master.Thus the grace period allows sessions to be maintainedacross fail-overs that exceed the normal lease timeout.Figure 2 shows the sequence of events in a lengthy
master fail-over event in which the client must use itsgrace period to preserve its session. Time increases fromleft to right, but times are not to scale. Client ses-
sion leases are shown as thick arrows both as viewedby both the old and new masters (M1-3, above) and theclient (C1-3, below). Upward angled arrows indicateKeepAlive requests, and downward angled arrows theirreplies. The original master has session lease M1 forthe client, while the client has a conservative approxima-tion C1. The master commits to lease M2 before inform-ing the client via KeepAlive reply 2; the client is able toextend its view of the lease C2. The master dies beforereplying to the next KeepAlive, and some time elapsesbefore another master is elected. Eventually the client’sapproximation of its lease (C2) expires. The client thenflushes its cache and starts a timer for the grace period.During this period, the client cannot be sure whether
its lease has expired at the master. It does not tear downits session, but it blocks all application calls on its API toprevent the application from observing inconsistent data.At the start of the grace period, the Chubby library sendsa jeopardy event to the application to allow it to quiesceitself until it can be sure of the status of its session.Eventually a new master election succeeds. The mas-
ter initially uses a conservative approximation M3 of thesession lease that its predecessor may have had for theclient. The first KeepAlive request (4) from the client tothe new master is rejected because it has the wrong mas-ter epoch number (described in detail below). The retriedrequest (6) succeeds but typically does not extend themaster lease further because M3 was conservative. How-ever the reply (7) allows the client to extend its lease (C3)once more, and optionally inform the application that itssession is no longer in jeopardy. Because the grace pe-riod was long enough to cover the interval between theend of lease C2 and the beginning of lease C3, the clientsaw nothing but a delay. Had the grace period been lessthan that interval, the client would have abandoned thesession and reported the failure to the application.Once a client has contacted the new master, the client
library and master co-operate to provide the illusion tothe application that no failure has occurred. To achievethis, the new master must reconstruct a conservative ap-proximation of the in-memory state that the previousmaster had. It does this partly by reading data storedstably on disc (replicated via the normal database repli-
CS 457/557 Introduction to Distributed Systems 15
Use as a name service • Chubby’s most popular use • For availability, need to set DNS TTL short, but often overwhelms DNS server • Must poll each DNS entry: O(N2) • Chubby is invalidation based • Besides KeepAlives, no need to poll
CS 457/557 Introduction to Distributed Systems 16
BigTable: A distributed storage system for structured data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
Slides ack to: Mohsen Taheriyan
http://www-scf.usc.edu/~csci572/2011Spring/presentations/Taheriyan.pptx
Motivation • Lots of (semi-)structured data at Google
• URLs: • Contents, crawl metadata, links, anchors, pagerank, …
• Per-user data: • User preference settings, recent queries/search results, …
• Geographic locations: • Physical entities (shops, restaurants, etc.), roads, satellite image
data, user annotations, …
• Scale is large • Billions of URLs, many versions/page (~20K/version) • Hundreds of millions of users, thousands of q/sec • 100TB+ of satellite image data
18 CS 457/557 Introduction to Distributed Systems
Goals • Want asynchronous processes to be continuously updating different pieces of data • Want access to most current data at any time
• Need to support: • Very high read/write rates (millions of ops per second) • Efficient scans over all or interesting subsets of data • Efficient joins of large one-to-one and one-to-many
datasets • Often want to examine data changes over time
• E.g., contents of a web page over multiple crawls
19 CS 457/557 Introduction to Distributed Systems
Why not commercial DB? • Scale is too large for most commercial databases • Cost would be very high
• Building internally means system can be applied across many projects for low incremental cost
• Low-level storage optimizations help performance significantly • Much harder to do when running on top of a database
layer
20 CS 457/557 Introduction to Distributed Systems
BigTable • A distributed storage system for managing structured data.
• Scalable • Thousands of servers • Terabytes of in-memory data • Petabyte of disk-based data • Millions of reads/writes per second, efficient scans
• Self-managing • Servers can be added/removed dynamically • Servers adjust to load imbalance
• Used for many Google projects • Web indexing, Personalized Search, Google Earth, Google
Analytics, Google Finance, …
21 CS 457/557 Introduction to Distributed Systems
Basic data model • “A BigTable is a Sparse, distributed, persistent multidimensional sorted map.” (row:string, column:string, 1me:int64) à string
22
Webtable
sorted by reverse URL column families
row key: up to 64KB, 10-100B typical)
cell w/ timestamped versions +
garbage collection
CS 457/557 Introduction to Distributed Systems
Rows • Name is an arbitrary string
• Access to data in a row is atomic • Row creation is implicit upon storing data
• Rows ordered lexicographically • Rows close together lexicographically usually on one or
a small number of machines • Does not support relational model
• No table wide integrity constants • No multi row transactions
23 CS 457/557 Introduction to Distributed Systems
Columns • Columns have two-level name structure:
• family:optional_qualifier • Column family
• Unit of access control • Has associated type information
• Qualifier gives unbounded columns • Additional levels of indexing, if desired
24
“CNN homepage”
“anchor:cnnsi.com”
“…” cnn.com
“contents:” “anchor:stanford.edu”
“CNN”
CS 457/557 Introduction to Distributed Systems
Timestamps • Used to store different versions of data in a cell
• 64-bits integers • New writes default to current time, but timestamps for
writes can also be set explicitly by clients • Lookup Options
• “Return most recent K values” • “Return all values in timestamp range (or all values)”
• Column families can be marked w/ attributes: • “Only retain most recent K values in a cell” • “Keep values until they are older than K seconds”
25 CS 457/557 Introduction to Distributed Systems
API • Metadata operations
• Create/delete tables, column families, change metadata • Writes (atomic)
• Set(): write cells in a row • DeleteCells(): delete cells in a row • DeleteRow(): delete all cells in a row
• Reads • Scanner: read arbitrary cells in a bigtable
• Each row read is atomic • Can restrict returned rows to a particular range • Can ask for just data from 1 row, all rows, etc. • Can ask for all columns, just certain column families, or specific
columns
26 CS 457/557 Introduction to Distributed Systems
Tablet & Splitting • A Bigtable table is partitioned into many tablets based on
row keys • Tablets (100-200MB each) are stored in a particular structure in GFS
• Each tablet is served by one tablet server
27
… Tablets
“com.cnn”
“contents:”
“<html>…”
“language:”
EN
“com.cnn/sports.html”
“com.zuppa/menu.html”
… “com.yahoo/kids.html”
“com.yahoo/kids.html?d” …
… “com.website”
“com.aaa” Tablet: Start: com.aaa End: com.cnn
CS 457/557 Introduction to Distributed Systems
Tablet structure • Uses Google SSTables, a key building block • An SSTable:
• Is an immutable, sorted file of key-value pairs • SSTable files are stored in GFS • Keys are: <row, column, timestamp>
CS 457/557 Introduction to Distributed Systems 28
64KB Block
64KB Block
64KB Block
Index (block ranges)
SSTable
…
A Tablet store a range of rows from a table using SSTables
29
64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
SSTable
… 64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
SSTable
……
Tablet Start: aardvark End: apple
built from multiple SSTables
CS 457/557 Introduction to Distributed Systems
A Table consists of a sequence of tablets.
30
Tablet Start: aardvark End: apple
Tablet Start: apple_2e End: boat
64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
SSTable
… 64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
SSTable
… 64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
SSTable
… 64KBBlock
64KBBlock
64KBBlock
Index (block ranges)
SSTable
…
Table
SSTables may be shared No overlap across tablets
CS 457/557 Introduction to Distributed Systems
Implementation: supporting services • GFS
• For storing log and data files • Chubby
• Ensure there is only one active master • Store bootstrap location of Bigtable data • Discover tablet servers • Store Bigtable schema information • Store access control lists
CS 457/557 Introduction to Distributed Systems 31
Implementation • Library linked into every client • One master server
• Assigns/load-balances tablets to tablet servers • Detects up/down tablet servers • Detecting addition and expiration of tablet servers • Schema changes (add/remove column families) • Garbage collection • Does NOT provide tablet location
• Many tablet servers • Tablet servers handle R/W requests to its table • Splits tablets that have grown too large
32 CS 457/557 Introduction to Distributed Systems
BigTable System Architecture
33
GFS
holds tablet data, logs
Lock service
holds metadata, handles master-election
Bigtable tablet server
serves data
Bigtable tablet server
serves data
Bigtable tablet server
serves data
Bigtable master
performs metadata ops, load balancing
Bigtable cell Bigtable client Bigtable client
library
Open() Read/write
Metadata ops
CS 457/557 Introduction to Distributed Systems
Locating tablets • Since tablets move around from server to server, given a row, how do clients find the right machine? • Tablet property – startRowIndex and endRowIndex • Need to find tablet whose row range covers the target
row • One approach: could use the BigTable master
• Central server almost certainly would be bottleneck in large system
• Instead: store special tables containing tablet location information in BigTable cell itself
34 CS 457/557 Introduction to Distributed Systems
Tablets are located using a B+ tree-like structure.
35
… …
Chubby lock file
Root tablet: 1st METADATA
(unsplitable)
METADATA
UserTable_1
UserTable_N
<table_id, end_row> à location
Each METADATA record ~1KB Max METADATA table = 128MB
Addressable memory (3 tiers) = 221 TB
CS 457/557 Introduction to Distributed Systems
Tablet assignment • 1 Tablet à 1 Tablet server • Master
• keeps tracks of set of live tablet serves and unassigned tablets. • Master sends a tablet load request for unassigned tablet to the tablet
server.
• BigTable uses Chubby to keep track of tablet servers. • On startup a tablet server:
• Tablet server creates and acquires an exclusive lock on uniquely named file in Chubby directory.
• Master monitors the above directory to discover tablet servers.
• Tablet server stops serving tablets if its loses its exclusive lock. • Tries to reacquire the lock on its file as long as the file still exists.
36 CS 457/557 Introduction to Distributed Systems
Tablet assignment • If the file no longer exists, tablet server not able to serve again and kills itself.
• Master is responsible for finding when tablet server is no longer serving its tablets and reassigning those tablets as soon as possible.
• Master detects by checking periodically the status of the lock of each tablet server. • If tablet server reports the loss of lock • Or if master could not reach tablet server after several
attempts.
37 CS 457/557 Introduction to Distributed Systems
Tablet assignment • Master tries to acquire an exclusive lock on server’s file. • If master is able to acquire lock, then chubby is alive
and tablet server is either dead or having trouble reaching chubby.
• If so master makes sure that tablet server never can server again by deleting its server file.
• Master moves all the assigned tablets into set of unassigned tablets.
• If Chubby session expires, master kills itself. • When master is started, it needs to discover the current tablet assignment.
38 CS 457/557 Introduction to Distributed Systems
Master startup operation • Grabs unique master lock in Chubby
• Prevents server instantiations
• Scans directory in Chubby for live servers • Communicates with every live tablet server
• Discover all tablets
• Scans METADATA table to learn the set of tablets • Unassigned tables are marked for assignment
39 CS 457/557 Introduction to Distributed Systems
Sample applications • Google Analytics
• Raw Click Table (~200 TB) • Row for each end-user session • Row name: {website name and time of session} • Sessions that visit the same web site are sorted & contiguous
• Summary Table (~20 TB) • Contains various summaries for each crawled website • Generated from the Raw Click table via periodic MapReduce
jobs
CS 457/557 Introduction to Distributed Systems 40
Sample applications • Personalized Search
• One Bigtable row per user (unique user ID) • Column family per type of action
• E.g., column family for web queries (your entire search history!)
• Bigtable timestamp for each element identifies when the event occurred
• Uses MapReduce over Bigtable to personalize live search results
CS 457/557 Introduction to Distributed Systems 41
BigTable replication • Each table can be configured for replication to multiple Bigtable clusters in different data centers
• Eventual consistency model
CS 457/557 Introduction to Distributed Systems 42