PNUTS: Y AHOO !’ S H OSTED D ATA S ERVING P LATFORM B RIAN F. C OOPER, R AGHU R AMAKRISHNAN, U...

Post on 27-Dec-2015

215 views 2 download

transcript

PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM

BRIAN F. COOPER, RAGHU RAMAKRISHNAN, UTKARSH SRIVASTAVA, ADAM SILBERSTEIN, PHILIP BOHANNON, HANS-ARNO JACOBSEN, NICK PUZ, DANIEL WEAVER AND RAMANA YERNENI

YAHOO! RESEARCH

Presented by Team Silverlining-

Rakesh Nair, Navya Sruti Sirugudi, Shantanu Sardal, Smruti Aski, Chandra Sekhar

2

DISTRIBUTED DATABASES – OVERVIEW

Web applications need: Scalability

And the ability to scale linearly Geographic scope High availability and fault tolerance

Web applications typically have: Simplified query needs

No joins, aggregations Relaxed consistency needs

Applications can tolerate stale or reordered data

AGENDA Introduction PNUTS Features Architecture PNUTS applications Experimental Results Feature Enhancements Related Work

PNUTS

A massive-scale hosted database system

Focus on data serving for web applications

Provides data storage organized as hashed or ordered tables

Low latency for large numbers of concurrent requests

Novel per-record consistency guarantees

5

WHAT IS PNUTS?

E 75656 C

A 42342 EB 42521 W

C 66354 WD 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 WD 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel database Geographic replication

Indexes and views

Structured, flexible schema

Hosted, managed infrastructure

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

FEATURES Data Model and Features

Relational data model, scatter-gather operations, asynchronous notifications, bulk loading

Fault Tolerance Employs redundancy, supports low-latency reads and

writes even after failure Pub-Sub Message System

Asynchronous operations carried out using YMB Record-level Mastering

All high-latency operations are asynchronous Hosting

Centrally managed database service shared by multiple applications

DESIGN DECISIONS Record-level, asynchronous geographic replication

Guaranteed message delivery service

Consistency model which is not fully serialized

Hashed and ordered table organizations, flexible schema

Data management as a hosted service

8

SCALABILITY

Data-path components

Storage units

Routers

Tablet controller

REST API

Clients

MessageBroker

9

REPLICATION

Storageunits

Routers

Tablet controller

REST API

Clients

Local region Remote regions

YMB

DATA AND QUERY MODEL Data organized into tables of records with attributes

Query language of PNUTS supports selection and projection from a single table.

PNUTS allows application declare tables to be hashed or ordered.

11

QUERY MODEL Per-record operations

Get Set Delete

Multi-record operations Multiget Scan Getrange

Web service (RESTful) API

CONSISTENCY MODEL Web applications typically manipulate one record at a

time.

Per-record timeline consistency Data in PNUTS is replicated across sites Each record contains

Sequence number – #updates since the time of creation Version number – changes on each update on record

Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master

Record also contains origin of last few updates Mastership can be changed by current master, based on this

information Mastership change is simply a record update

13

CONSISTENCY MODEL Goal: make it easier for applications to reason about

updates and cope with asynchrony

What happens to a record with primary key “Brian”?

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Update Update

14

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Current version

Stale versionStale version

Read

15

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

16

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

Read-critical(required version):

17

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write

Current version

Stale versionStale version

18

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Test-and-set-write(required version)

19

CONSISTENCY MODEL (APIS)

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Mechanism: per record mastership

SYSTEM ARCHITECTURE System divided into regions typically geographically

distributed

Each region contains a complete copy of each table

Use pub/sub mechanism for reliability and replication (Yahoo Message Broker)

Data tables are horizontally partitioned into groups of records called tablets.

Each server might have hundreds or thousands of tablets.

21

TABLET SPLITTING AND BALANCING

Each storage unit has many tablets (horizontal partitions of the table)

Tablets may grow over timeOverfull tablets split

Storage unit may become a hotspot

Shed load by moving tablets to other servers

Storage unitTablet

READING DATA Three components:

Storage Unit (SU) Router Tablet Controller

Each router contains interval mapping of each tablet boundry mapped to the SU containing the tablet. For ordered tables, the primary key space is divided into

intervals. For hash tables, the hash space is divided into intervals

for each tablet.

TABLET CONTROLLER Routers contain only a cached copy of the interval

mapping.

Mapping owned by tablet controller

Routers get an update of the mapping from the tablet controller when a read request fails

Simplifies router’s failure recovery

24

ACCESSING SINGLE RECORD

SUSU SU

1Get key k

2Get key k3Record for key k

4Record for key k

25

BULK READ

SUScatter

/gather server

SU SU

1{k1, k2, … kn}

2Get k1

Get k2 Get k3

26

RANGE QUERIES

MIN-Canteloupe

SU1

Canteloupe-Lime

SU3

Lime-Strawberry

SU2

Strawberry-MAX

SU1

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemonLimeMangoOrange

StrawberryTomatoWatermelon

Grapefruit…Pear?

Grapefruit…Lime?

Lime…Pear?

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU1MIN-Canteloupe

27

UPDATES

1Write key k

2Write key k7Sequence # for key k

8Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

28

YAHOO MESSAGE BROKER Distributed publish-subscribe service

Guarantees delivery once a message is published Logging at site where message is published, and at other

sites when received

Guarantees messages published to a particular cluster will be delivered in same order at all other clusters

Record updates are published to YMB by master copy (Record-level mastering) All replicas subscribe to the updates, and get them in

same order for a particular record

29

ASYNCHRONOUS REPLICATION

30

OTHER FEATURES Per record transactions

Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is

received Apply later updates

Tablet split Has to be coordinated across all copies

31

QUERY PROCESSING Range scan can span tablets done by scatter gather

engine (in router) Only one tablet scanned at a time Client may not need all results at once

Continuation object returned to client to indicate where range scan should continue

Notification One pub-sub topic per tablet Client knows about tables, does not know about tablets

Automatically subscribed to all tablets, even as tablets are added/removed.

Usual problem with pub-sub: undelivered notifications, handled in usual way

PNUTS APPLICATIONS User Database

Millions of active Yahoo users – user profiles, IM buddy lists Record timeline - relaxed consistency Hosted DB – many apps sharing same data

Social and Web 2.0 Apps Rapidly evolving and expanding – flexible schema Connections in a social graph – ordered table abstraction

Content Metadata Bulk data – distributed FS, metadata – PNUTS Helps high performance operations like file creation, deletion,

renaming

Listings Management Comparison shopping (sorted by price, rating, etc) Ordered table and views – data sorted by price, ratings,etc

Session Data Large session-state storage PNUTS as a service – easy access to session store

33

EXPERIMENTAL SETUP Production PNUTS code

Enhanced with ordered table type

Three PNUTS regions 2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5 array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk

Workload 1200-3600 requests/second 0-50% writes 80% locality

34

INSERTS

Required 75.6 ms per insert in West 1 (tablet master)

131.5 ms per insert into the non-master West 2, and

315.5 ms per insert into the non-master East.

35

10% writes by default

36

SCALABILITY

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6

Storage units

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

37

REQUEST SKEW

0

10

20

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Zipf parameter

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

38

SIZE OF RANGE SCANS

0

1000

2000

3000

4000

5000

6000

7000

8000

0 0.02 0.04 0.06 0.08 0.1 0.12

Fraction of table scanned

Ave

rag

e la

ten

cy (

ms)

30 clients 300 clients

39

RELATED WORK Distributed and parallel databases

Especially query processing and transactions BigTable, Dynamo, S3, SimpleDB, SQL Server Data

Services, Cassandra

Distributed filesystems Ceph, Boxwood, Sinfonia

Distributed (P2P) hash tables Chord, Pastry, …

Database replication Master-slave, epidemic/gossip, synchronous…

40

CONCLUSIONS AND ONGOING WORK PNUTS is an interesting research product

Research: consistency, performance, fault tolerance, rich functionality

Product: make it work, keep it (relatively) simple, learn from experience and real applications

Ongoing work Indexes and materialized views Bundled updates Batch query processing

SUMMARY Aim of PNUTS

Rich Database functionality Low latency on a massive scale

Tradeoffs between functionality, performance and scalability Asynchronous replication – Low write latency Consistency Model – Useful guarantees without sacrificing

scalability Hosted Service – Minimize operation costs for applications Features Limited – Preserving Reliability and Scale

Novel Aspects Per-record timeline consistency - Asynchronous replication Message broker - Replication mechanism, Redo log Flexible mapping of tablets to storage units – Auto Failover, Load

Balancing

THANK YOU!

Questions??