Case Studies: Chubby and BigTable · System architecture • A chubby cell consists of a small set...

Case Studies: Chubby and BigTable

Yao Liu

The Chubby lock service for loosely-coupled distributed systems

Mike Burrows (Google), OSDI 2006

Slides ack to: Shimin Chen: http://www.cs.cmu.edu/~chensm/Big_Data_reading_group/slides/shimin-chubby.ppt

Introduction • What is Chubby?

•  Lock service in a loosely-coupled distributed system (e.g., 10K 4-processor machines connected by 1Gbps Ethernet)

•  Client interface similar to whole-file advisory locks with notification of various events (e.g., file modifications)

•  Primary goals: reliability, availability, easy-to-understand semantics

• How is it used? •  Used in Google: GFS, Bigtable, etc. •  Elect leaders, store small amount of meta-data, as the root

of the distributed data structures

CS 457/557 Introduction to Distributed Systems 3

System architecture

•  A chubby cell consists of a small set of servers (replicas) •  A master is elected from the replicas via a consensus protocol

•  Master lease: several seconds •  If a master fails, a new one will be elected when the master leases expire

•  Client talks to the master via chubby library •  All replicas are listed in DNS; clients discover the master by talking to any

replica

a replica


System architecture (2)

•  Replicas maintain copies of a simple database •  Clients send read/write requests only to the master •  For a write:

•  The master propagates it to replicas via the consensus protocol •  Replies after the write reaches a majority of replicas

•  For a read: •  The master satisfies the read alone

a replica


System architecture (3)

•  If a replica fails and does not recover for a long time (a few hours) •  A fresh machine is selected to be a new replica, replacing the failed one •  It updates the DNS •  Obtains a recent copy of the database •  The current master polls DNS periodically to discover new replicas

a replica


Master election • At any point in time, there must be at most one master • No two nodes must think they are masters at same time

• Example: • Suppose A is master and it gets disconnected from B • B times out trying to talk to A, thinks A is dead, and

proposes that it be the master •  If other nodes agree and A doesn’t hear about the new

master, then A will continue to act as master for a while, accepting read requests for what could be stale data, for example


Master election in Chubby • When a master dies, a node proposes a master change through Paxos

• When nodes receive the proposal, they will only accept it if the old master’s lease has expired

• A node becomes the master if a majority of nodes have given it the accept to become the master

• Once a node becomes a master, it knows that it will remain so for at least the lease period •  It can extend the lease by getting the accept from a

majority of the nodes


Chubby interface: UNIX-like file system interface

• Chubby supports a strict tree of files and directories • No symbolic links, no hard links •  /ls/foo/wombat/pouch

•  1st component (ls): lock service (common to all names) •  2nd component (foo): the chubby cell (used in DNS lookup to find the

cell master) •  The rest: name inside the cell

• Can be accessed via Chubby’s specialized API / other file system interface (e.g., GFS)

• Support most normal operations (create, delete, open, write, …)

• Support advisory reader/writer lock on a node


Chubby events • Clients can subscribe to events (up-calls from Chubby library) •  File contents modified: if the file contains the location of

a service, this event can be used to monitor the service location

• Master failed over • Child node added, removed, modified • Handle becomes invalid: probably communication

problem •  Lock acquired (rarely used) •  Locks are conflicting (rarely used)


APIs • Open()

•  Mode: read/write/change ACL; Events; Lock-delay •  Create new file or directory?

• Close() • GetContentsAndStat(), GetStat(), ReadDir() • SetContents(): set all contents; SetACL() • Delete() •  Locks: Acquire(), TryAcquire(), Release() • Sequencers: GetSequencer(), SetSequencer(),

CheckSequencer()


Example: Primary Election Open(“/ls/foo/OurServicePrimary”, “write mode”); If (successful) { // primary

SetContents(“identity”); } Else { // replica open (“read mode”, “file-modification event”); when notified of file modification: primary= GetContentsAndStat(); }


Caching • Sequantial consistency: easy to understand

•  Lease based • master will invalidate cached copies upon a write

request • Write-through caches


Sessions, keep-alives, master fail-overs • Session:

• A client sends keep-alive requests to a master • A master responds by a keep-alive response •  Immediately after getting the keep-alive response, the client

sends another request for extension •  The master will block keep-alives until close the expiration

of a session • Extension is default to 12s

• Clients maintain a local timer for estimating the session timeouts (time is not perfectly synchronized)

•  If local timer runs out, wait for a 45s grace period before ending the session • Happens when a master fails over


Master fail-over

lease M3

no master

lease M1lease M2-- -

??

KeepAlives

lease C1- -

--

� -

6jeopardy

⇤⇤⇤⇤⌫

⇤⇤⇤⇤⌫C

CCCW

CCCCW ⇤⇤⇤⇤⌫

⇤⇤⇤⇤⌫

6

safe

lease C2

old master dies new master elected

1

2

3 4 6

5

grace period

⇤⇤⇤⇤⌫ 8

7

CCCCW

OLD MASTER NEWMASTER

CLIENTlease C3

Figure 2: The role of the grace period in master fail-over

fore the end of the client’s grace period, the client enablesits cache once more. Otherwise, the client assumes thatthe session has expired. This is done so that Chubby APIcalls do not block indefinitely when a Chubby cell be-comes inaccessible; calls return with an error if the graceperiod ends before communication is re-established.The Chubby library can inform the application when

the grace period begins via a jeopardy event. When thesession is known to have survived the communicationsproblem, a safe event tells the client to proceed; if thesession times out instead, an expired event is sent. Thisinformation allows the application to quiesce itself whenit is unsure of the status of its session, and to recoverwithout restarting if the problem proves to be transient.This can be important in avoiding outages in serviceswith large startup overhead.If a client holds a handle H on a node and any oper-

ation on H fails because the associated session has ex-pired, all subsequent operations on H (except Close()and Poison()) will fail in the same way. Clients can usethis to guarantee that network and server outages causeonly a suffix of a sequence of operations to be lost, ratherthan an arbitrary subsequence, thus allowing complexchanges to be marked as committed with a final write.

2.9 Fail-overs

When a master fails or otherwise loses mastership, it dis-cards its in-memory state about sessions, handles, andlocks. The authoritative timer for session leases runs atthe master, so until a new master is elected the sessionlease timer is stopped; this is legal because it is equiva-lent to extending the client’s lease. If a master electionoccurs quickly, clients can contact the new master beforetheir local (approximate) lease timers expire. If the elec-tion takes a long time, clients flush their caches and waitfor the grace period while trying to find the new master.Thus the grace period allows sessions to be maintainedacross fail-overs that exceed the normal lease timeout.Figure 2 shows the sequence of events in a lengthy

master fail-over event in which the client must use itsgrace period to preserve its session. Time increases fromleft to right, but times are not to scale. Client ses-

sion leases are shown as thick arrows both as viewedby both the old and new masters (M1-3, above) and theclient (C1-3, below). Upward angled arrows indicateKeepAlive requests, and downward angled arrows theirreplies. The original master has session lease M1 forthe client, while the client has a conservative approxima-tion C1. The master commits to lease M2 before inform-ing the client via KeepAlive reply 2; the client is able toextend its view of the lease C2. The master dies beforereplying to the next KeepAlive, and some time elapsesbefore another master is elected. Eventually the client’sapproximation of its lease (C2) expires. The client thenflushes its cache and starts a timer for the grace period.During this period, the client cannot be sure whether

its lease has expired at the master. It does not tear downits session, but it blocks all application calls on its API toprevent the application from observing inconsistent data.At the start of the grace period, the Chubby library sendsa jeopardy event to the application to allow it to quiesceitself until it can be sure of the status of its session.Eventually a new master election succeeds. The mas-

ter initially uses a conservative approximation M3 of thesession lease that its predecessor may have had for theclient. The first KeepAlive request (4) from the client tothe new master is rejected because it has the wrong mas-ter epoch number (described in detail below). The retriedrequest (6) succeeds but typically does not extend themaster lease further because M3 was conservative. How-ever the reply (7) allows the client to extend its lease (C3)once more, and optionally inform the application that itssession is no longer in jeopardy. Because the grace pe-riod was long enough to cover the interval between theend of lease C2 and the beginning of lease C3, the clientsaw nothing but a delay. Had the grace period been lessthan that interval, the client would have abandoned thesession and reported the failure to the application.Once a client has contacted the new master, the client

library and master co-operate to provide the illusion tothe application that no failure has occurred. To achievethis, the new master must reconstruct a conservative ap-proximation of the in-memory state that the previousmaster had. It does this partly by reading data storedstably on disc (replicated via the normal database repli-


Use as a name service • Chubby’s most popular use • For availability, need to set DNS TTL short, but often overwhelms DNS server • Must poll each DNS entry: O(N2) • Chubby is invalidation based • Besides KeepAlives, no need to poll


BigTable: A distributed storage system for structured data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

Slides ack to: Mohsen Taheriyan

http://www-scf.usc.edu/~csci572/2011Spring/presentations/Taheriyan.pptx

Motivation • Lots of (semi-)structured data at Google

• URLs: •  Contents, crawl metadata, links, anchors, pagerank, …

• Per-user data: •  User preference settings, recent queries/search results, …

• Geographic locations: •  Physical entities (shops, restaurants, etc.), roads, satellite image

data, user annotations, …

• Scale is large • Billions of URLs, many versions/page (~20K/version) • Hundreds of millions of users, thousands of q/sec •  100TB+ of satellite image data

18 CS 457/557 Introduction to Distributed Systems

Goals • Want asynchronous processes to be continuously updating different pieces of data • Want access to most current data at any time

• Need to support: • Very high read/write rates (millions of ops per second) • Efficient scans over all or interesting subsets of data • Efficient joins of large one-to-one and one-to-many

datasets • Often want to examine data changes over time

• E.g., contents of a web page over multiple crawls


Why not commercial DB? • Scale is too large for most commercial databases • Cost would be very high

• Building internally means system can be applied across many projects for low incremental cost

• Low-level storage optimizations help performance significantly • Much harder to do when running on top of a database

layer


BigTable • A distributed storage system for managing structured data.

• Scalable •  Thousands of servers •  Terabytes of in-memory data •  Petabyte of disk-based data •  Millions of reads/writes per second, efficient scans

• Self-managing •  Servers can be added/removed dynamically •  Servers adjust to load imbalance

• Used for many Google projects •  Web indexing, Personalized Search, Google Earth, Google

Analytics, Google Finance, …


Basic data model • “A BigTable is a Sparse, distributed, persistent multidimensional sorted map.” (row:string, column:string, 1me:int64) à string

22

Webtable

sorted by reverse URL column families

row key: up to 64KB, 10-100B typical)

cell w/ timestamped versions +

garbage collection

CS 457/557 Introduction to Distributed Systems

Rows • Name is an arbitrary string

• Access to data in a row is atomic • Row creation is implicit upon storing data

• Rows ordered lexicographically • Rows close together lexicographically usually on one or

a small number of machines • Does not support relational model

• No table wide integrity constants • No multi row transactions


Columns • Columns have two-level name structure:

• family:optional_qualifier • Column family

• Unit of access control • Has associated type information

• Qualifier gives unbounded columns • Additional levels of indexing, if desired

24

“CNN homepage”

“anchor:cnnsi.com”

“…” cnn.com

“contents:” “anchor:stanford.edu”

“CNN”


Timestamps • Used to store different versions of data in a cell

•  64-bits integers • New writes default to current time, but timestamps for

writes can also be set explicitly by clients • Lookup Options

•  “Return most recent K values” •  “Return all values in timestamp range (or all values)”

• Column families can be marked w/ attributes: •  “Only retain most recent K values in a cell” •  “Keep values until they are older than K seconds”


API • Metadata operations

• Create/delete tables, column families, change metadata • Writes (atomic)

• Set(): write cells in a row • DeleteCells(): delete cells in a row • DeleteRow(): delete all cells in a row

• Reads • Scanner: read arbitrary cells in a bigtable

•  Each row read is atomic •  Can restrict returned rows to a particular range •  Can ask for just data from 1 row, all rows, etc. •  Can ask for all columns, just certain column families, or specific

columns


Tablet & Splitting • A Bigtable table is partitioned into many tablets based on

row keys •  Tablets (100-200MB each) are stored in a particular structure in GFS

• Each tablet is served by one tablet server

27

… Tablets

“com.cnn”

“contents:”

“<html>…”

“language:”

EN

“com.cnn/sports.html”

“com.zuppa/menu.html”

… “com.yahoo/kids.html”

“com.yahoo/kids.html?d” …

… “com.website”

“com.aaa” Tablet: Start: com.aaa End: com.cnn


Tablet structure • Uses Google SSTables, a key building block • An SSTable:

•  Is an immutable, sorted file of key-value pairs • SSTable files are stored in GFS • Keys are: <row, column, timestamp>


64KB Block

64KB Block

64KB Block

Index (block ranges)

SSTable

…

A Tablet store a range of rows from a table using SSTables

29

64KBBlock

64KBBlock

64KBBlock


SSTable

… 64KBBlock

64KBBlock

64KBBlock


SSTable

……

Tablet Start: aardvark End: apple

built from multiple SSTables


A Table consists of a sequence of tablets.

30

Tablet Start: aardvark End: apple

Tablet Start: apple_2e End: boat

64KBBlock

64KBBlock

64KBBlock


SSTable

… 64KBBlock

64KBBlock

64KBBlock


SSTable

… 64KBBlock

64KBBlock

64KBBlock


SSTable

… 64KBBlock

64KBBlock

64KBBlock


SSTable

…

Table

SSTables may be shared No overlap across tablets


Implementation: supporting services • GFS

• For storing log and data files • Chubby

• Ensure there is only one active master • Store bootstrap location of Bigtable data • Discover tablet servers • Store Bigtable schema information • Store access control lists


Implementation • Library linked into every client • One master server

• Assigns/load-balances tablets to tablet servers • Detects up/down tablet servers • Detecting addition and expiration of tablet servers • Schema changes (add/remove column families) • Garbage collection • Does NOT provide tablet location

• Many tablet servers •  Tablet servers handle R/W requests to its table • Splits tablets that have grown too large


BigTable System Architecture

33

GFS

holds tablet data, logs

Lock service

holds metadata, handles master-election

Bigtable tablet server

serves data


serves data


serves data

Bigtable master

performs metadata ops, load balancing

Bigtable cell Bigtable client Bigtable client

library

Open() Read/write

Metadata ops


Locating tablets • Since tablets move around from server to server, given a row, how do clients find the right machine? •  Tablet property – startRowIndex and endRowIndex • Need to find tablet whose row range covers the target

row • One approach: could use the BigTable master

• Central server almost certainly would be bottleneck in large system

•  Instead: store special tables containing tablet location information in BigTable cell itself


Tablets are located using a B+ tree-like structure.

35

… …

Chubby lock file

Root tablet: 1st METADATA

(unsplitable)

METADATA

UserTable_1

UserTable_N

<table_id, end_row> à location

Each METADATA record ~1KB Max METADATA table = 128MB

Addressable memory (3 tiers) = 221 TB


Tablet assignment •  1 Tablet à 1 Tablet server • Master

•  keeps tracks of set of live tablet serves and unassigned tablets. •  Master sends a tablet load request for unassigned tablet to the tablet

server.

• BigTable uses Chubby to keep track of tablet servers. • On startup a tablet server:

•  Tablet server creates and acquires an exclusive lock on uniquely named file in Chubby directory.

•  Master monitors the above directory to discover tablet servers.

•  Tablet server stops serving tablets if its loses its exclusive lock. •  Tries to reacquire the lock on its file as long as the file still exists.


Tablet assignment •  If the file no longer exists, tablet server not able to serve again and kills itself.

• Master is responsible for finding when tablet server is no longer serving its tablets and reassigning those tablets as soon as possible.

• Master detects by checking periodically the status of the lock of each tablet server. •  If tablet server reports the loss of lock • Or if master could not reach tablet server after several

attempts.


Tablet assignment • Master tries to acquire an exclusive lock on server’s file. •  If master is able to acquire lock, then chubby is alive

and tablet server is either dead or having trouble reaching chubby.

•  If so master makes sure that tablet server never can server again by deleting its server file.

• Master moves all the assigned tablets into set of unassigned tablets.

•  If Chubby session expires, master kills itself. • When master is started, it needs to discover the current tablet assignment.


Master startup operation • Grabs unique master lock in Chubby

• Prevents server instantiations

• Scans directory in Chubby for live servers • Communicates with every live tablet server

• Discover all tablets

• Scans METADATA table to learn the set of tablets • Unassigned tables are marked for assignment


Sample applications • Google Analytics

• Raw Click Table (~200 TB) •  Row for each end-user session •  Row name: {website name and time of session} •  Sessions that visit the same web site are sorted & contiguous

• Summary Table (~20 TB) •  Contains various summaries for each crawled website •  Generated from the Raw Click table via periodic MapReduce

jobs


Sample applications • Personalized Search

• One Bigtable row per user (unique user ID) • Column family per type of action

• E.g., column family for web queries (your entire search history!)

• Bigtable timestamp for each element identifies when the event occurred

• Uses MapReduce over Bigtable to personalize live search results


BigTable replication • Each table can be configured for replication to multiple Bigtable clusters in different data centers

• Eventual consistency model


Date post:	21-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Case Studies: Chubby and BigTable · System architecture • A chubby cell consists of a small set...

Documents