Post on 27-Dec-2015
transcript
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM
BRIAN F. COOPER, RAGHU RAMAKRISHNAN, UTKARSH SRIVASTAVA, ADAM SILBERSTEIN, PHILIP BOHANNON, HANS-ARNO JACOBSEN, NICK PUZ, DANIEL WEAVER AND RAMANA YERNENI
YAHOO! RESEARCH
Presented by Team Silverlining-
Rakesh Nair, Navya Sruti Sirugudi, Shantanu Sardal, Smruti Aski, Chandra Sekhar
2
DISTRIBUTED DATABASES – OVERVIEW
Web applications need: Scalability
And the ability to scale linearly Geographic scope High availability and fault tolerance
Web applications typically have: Simplified query needs
No joins, aggregations Relaxed consistency needs
Applications can tolerate stale or reordered data
AGENDA Introduction PNUTS Features Architecture PNUTS applications Experimental Results Feature Enhancements Related Work
PNUTS
A massive-scale hosted database system
Focus on data serving for web applications
Provides data storage organized as hashed or ordered tables
Low latency for large numbers of concurrent requests
Novel per-record consistency guarantees
5
WHAT IS PNUTS?
E 75656 C
A 42342 EB 42521 W
C 66354 WD 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 W
C 66354 WD 12352 E
F 15677 E
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
Parallel database Geographic replication
Indexes and views
Structured, flexible schema
Hosted, managed infrastructure
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
FEATURES Data Model and Features
Relational data model, scatter-gather operations, asynchronous notifications, bulk loading
Fault Tolerance Employs redundancy, supports low-latency reads and
writes even after failure Pub-Sub Message System
Asynchronous operations carried out using YMB Record-level Mastering
All high-latency operations are asynchronous Hosting
Centrally managed database service shared by multiple applications
DESIGN DECISIONS Record-level, asynchronous geographic replication
Guaranteed message delivery service
Consistency model which is not fully serialized
Hashed and ordered table organizations, flexible schema
Data management as a hosted service
8
SCALABILITY
Data-path components
Storage units
Routers
Tablet controller
REST API
Clients
MessageBroker
9
REPLICATION
Storageunits
Routers
Tablet controller
REST API
Clients
Local region Remote regions
YMB
DATA AND QUERY MODEL Data organized into tables of records with attributes
Query language of PNUTS supports selection and projection from a single table.
PNUTS allows application declare tables to be hashed or ordered.
11
QUERY MODEL Per-record operations
Get Set Delete
Multi-record operations Multiget Scan Getrange
Web service (RESTful) API
CONSISTENCY MODEL Web applications typically manipulate one record at a
time.
Per-record timeline consistency Data in PNUTS is replicated across sites Each record contains
Sequence number – #updates since the time of creation Version number – changes on each update on record
Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master
Record also contains origin of last few updates Mastership can be changed by current master, based on this
information Mastership change is simply a record update
13
CONSISTENCY MODEL Goal: make it easier for applications to reason about
updates and cope with asynchrony
What happens to a record with primary key “Brian”?
Time
Record inserted
Update Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Update Update
14
CONSISTENCY MODEL (APIS)
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Current version
Stale versionStale version
Read
15
CONSISTENCY MODEL (APIS)
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Read up-to-date
Current version
Stale versionStale version
16
CONSISTENCY MODEL (APIS)
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Read ≥ v.6
Current version
Stale versionStale version
Read-critical(required version):
17
CONSISTENCY MODEL (APIS)
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Write
Current version
Stale versionStale version
18
CONSISTENCY MODEL (APIS)
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Test-and-set-write(required version)
19
CONSISTENCY MODEL (APIS)
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7
Generation 1v. 6 v. 8
Write if = v.7
ERROR
Current version
Stale versionStale version
Mechanism: per record mastership
SYSTEM ARCHITECTURE System divided into regions typically geographically
distributed
Each region contains a complete copy of each table
Use pub/sub mechanism for reliability and replication (Yahoo Message Broker)
Data tables are horizontally partitioned into groups of records called tablets.
Each server might have hundreds or thousands of tablets.
21
TABLET SPLITTING AND BALANCING
Each storage unit has many tablets (horizontal partitions of the table)
Tablets may grow over timeOverfull tablets split
Storage unit may become a hotspot
Shed load by moving tablets to other servers
Storage unitTablet
READING DATA Three components:
Storage Unit (SU) Router Tablet Controller
Each router contains interval mapping of each tablet boundry mapped to the SU containing the tablet. For ordered tables, the primary key space is divided into
intervals. For hash tables, the hash space is divided into intervals
for each tablet.
TABLET CONTROLLER Routers contain only a cached copy of the interval
mapping.
Mapping owned by tablet controller
Routers get an update of the mapping from the tablet controller when a read request fails
Simplifies router’s failure recovery
24
ACCESSING SINGLE RECORD
SUSU SU
1Get key k
2Get key k3Record for key k
4Record for key k
25
BULK READ
SUScatter
/gather server
SU SU
1{k1, k2, … kn}
2Get k1
Get k2 Get k3
26
RANGE QUERIES
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
Storage unit 1 Storage unit 2 Storage unit 3
Router
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemonLimeMangoOrange
StrawberryTomatoWatermelon
Grapefruit…Pear?
Grapefruit…Lime?
Lime…Pear?
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU1MIN-Canteloupe
27
UPDATES
1Write key k
2Write key k7Sequence # for key k
8Sequence # for key k
SU SU SU
3Write key k
4
5SUCCESS
6Write key k
RoutersMessage brokers
28
YAHOO MESSAGE BROKER Distributed publish-subscribe service
Guarantees delivery once a message is published Logging at site where message is published, and at other
sites when received
Guarantees messages published to a particular cluster will be delivered in same order at all other clusters
Record updates are published to YMB by master copy (Record-level mastering) All replicas subscribe to the updates, and get them in
same order for a particular record
29
ASYNCHRONOUS REPLICATION
30
OTHER FEATURES Per record transactions
Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is
received Apply later updates
Tablet split Has to be coordinated across all copies
31
QUERY PROCESSING Range scan can span tablets done by scatter gather
engine (in router) Only one tablet scanned at a time Client may not need all results at once
Continuation object returned to client to indicate where range scan should continue
Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets
Automatically subscribed to all tablets, even as tablets are added/removed.
Usual problem with pub-sub: undelivered notifications, handled in usual way
PNUTS APPLICATIONS User Database
Millions of active Yahoo users – user profiles, IM buddy lists Record timeline - relaxed consistency Hosted DB – many apps sharing same data
Social and Web 2.0 Apps Rapidly evolving and expanding – flexible schema Connections in a social graph – ordered table abstraction
Content Metadata Bulk data – distributed FS, metadata – PNUTS Helps high performance operations like file creation, deletion,
renaming
Listings Management Comparison shopping (sorted by price, rating, etc) Ordered table and views – data sorted by price, ratings,etc
Session Data Large session-state storage PNUTS as a service – easy access to session store
33
EXPERIMENTAL SETUP Production PNUTS code
Enhanced with ordered table type
Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload 1200-3600 requests/second 0-50% writes 80% locality
34
INSERTS
Required 75.6 ms per insert in West 1 (tablet master)
131.5 ms per insert into the non-master West 2, and
315.5 ms per insert into the non-master East.
35
10% writes by default
36
SCALABILITY
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6
Storage units
Ave
rag
e la
ten
cy (
ms)
Hash table Ordered table
37
REQUEST SKEW
0
10
20
30
40
50
60
70
80
90
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Zipf parameter
Ave
rag
e la
ten
cy (
ms)
Hash table Ordered table
38
SIZE OF RANGE SCANS
0
1000
2000
3000
4000
5000
6000
7000
8000
0 0.02 0.04 0.06 0.08 0.1 0.12
Fraction of table scanned
Ave
rag
e la
ten
cy (
ms)
30 clients 300 clients
39
RELATED WORK Distributed and parallel databases
Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data
Services, Cassandra
Distributed filesystems Ceph, Boxwood, Sinfonia
Distributed (P2P) hash tables Chord, Pastry, …
Database replication Master-slave, epidemic/gossip, synchronous…
40
CONCLUSIONS AND ONGOING WORK PNUTS is an interesting research product
Research: consistency, performance, fault tolerance, rich functionality
Product: make it work, keep it (relatively) simple, learn from experience and real applications
Ongoing work Indexes and materialized views Bundled updates Batch query processing
SUMMARY Aim of PNUTS
Rich Database functionality Low latency on a massive scale
Tradeoffs between functionality, performance and scalability Asynchronous replication – Low write latency Consistency Model – Useful guarantees without sacrificing
scalability Hosted Service – Minimize operation costs for applications Features Limited – Preserving Reliability and Scale
Novel Aspects Per-record timeline consistency - Asynchronous replication Message broker - Replication mechanism, Redo log Flexible mapping of tablets to storage units – Auto Failover, Load
Balancing
THANK YOU!
Questions??