+ All Categories
Home > Documents > CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

Date post: 22-Dec-2015
Category:
View: 220 times
Download: 0 times
Share this document with a friend
Popular Tags:
42
CS294, Yelick Applications, p1 CS 294-8 Applications of Reliable Distributed Systems http://www.cs.berkeley.edu/~yelick /294
Transcript
Page 1: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p1

CS 294-8Applications of Reliable

Distributed Systemshttp://www.cs.berkeley.edu/~yelick/294

Page 2: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p2

Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail

Page 3: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p3

Specialization vs. Single Solution

• What does Grapevine do?– Message delivery (mail)– Naming, authentication, & access

control– Resource location

• What does Porcupine do?– Primarily a mail server– Uses DNS for naming

• Difference in distributed systems infrastructure

Page 4: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p4

Grapevine Prototype• 1983 configuration

– 17 servers (typically Altos)• 128 KB memory, 5MB disk, 30usec proc call

– 4400 individuals and 1500 groups– 8500 messages, 35,000 receptions per

day• Designed for up to

– 30 servers, 10K users• Used as actual mail server at Parc

– Grew from 5 to 30 servers over 3 years

Page 5: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p5

Porcupine Prototype• 30-node PC cluster (not-quite

identical) – Linux 2.2.7– 42,000 lines of C++ code– 100Mb/s Ethernet + 1Gb/s hubs

• Designed for up to– 1 billion messages/day

• Synthetic load

Page 6: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p6

Functional Homogeneity• Any node can perform any function

– Why did they consider abandoning it in Grapevine? Other internet services?

Functional homogeneity

Replication

Availability

Principle

Automatic reconfiguration

Dynamic Scheduling

Manageability Performance

Technique

Goals

Page 7: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p7

Evolutionary Growth Principle• To grow over time, a system must use

scalable data structures and algorithms

• Given p nodes– O(1) memory per node– O(1) time to perform important operations

• This is “ideal” but often impractical– E.g., O(log(p)) may be fine

• Each order of magnitude “feels” different: under 10, 100, 1K, 10K

Page 8: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p8

Separation of Hard/Soft State• Hard state: information that cannot

be lost, must use stable storage. Also uses replication to increase availability.– E.g., message bodies, passwords.

• Soft state: information that can be constructed from hard state. Not replicated (except for performance).– E.g., list of nodes containing user’s mail

Page 9: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p9

Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail

Page 10: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p10

Sending a Message: Grapevine

• User calls Grapevine User Package (GVUP) on own machine• GVUP broadcasts looking for servers• Name server returns list of registration servers• GVUP selects 1 and send mail to it• Mail server looks up name in “to” field• Connects to server with primary or secondary inbox for that name

client

server

registration

mail

server

primary inbox

server

secondary inbox

GVUP

Page 11: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p11

Replication in Grapevine• Sender side replication

– Any node can accept any mail• Receiver side replication

– Every user has 2 copies of their inbox• Messages bodies are not replicated

– Stored on disk; almost always recoverable

– Message bodies are shared among recipients (4.7x sharing on average)

• What conclusions can you draw?

Page 12: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p12

Reliability Limits in Grapevine

• Only one copy of message body• Direct connection between mail

server and (one of 2) inbox machines

• Others?

Page 13: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p13

Limits to Scaling in Grapevine

• Every registration server knows the names of all (15 KB for 17 nodes)– Registration servers– Registries: logical groups/mailing lists – Could add hierarchy for scaling

• Resource discovery– Linear search through all servers to

find the “closest” – How important is distance?

Page 14: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p14

Configuration Questions• When to add servers:

– When load is too high– When network is unreliable

• Where to distribute registries• Where to distribute inboxes• All decisions made by humans in Grapevine

– Some rules of thumb, e.g., for reg and mail servers, the primary inbox is local, the second is nearby and a third is at the other end of the internet.

– Is there something fundamental here? Handing node failures and link failures (partitions)?

Page 15: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p15

Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail

Page 16: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p16

Porcupine Structures• Mailbox fragment: chunk of mail messages

for 1 user (hard)• Mail map: list of nodes containing

fragments for each user (soft)• User profile db: names, passwords,…

(hard)• User profile soft state: copy of profile, used

for performance (soft)• User map: maps user name (hashed) to

node currently storing mail map and profile• Cluster membership: nodes currently

available (soft, but replicated)

Saito, 99

Page 17: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p17

Porcupine Architecture

Node A

Node B

Node Z

......

SMTPserver

POPserver

IMAPserver

MailmapMailbox

managerUser DBmanager

Replication Manager

Membership Manager

RPC

Load Balancer User map

Saito, 99

User manager

UserProfile

soft

Page 18: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p18

Porcupine OperationsProtocol handling

User lookup

Load Balancing

Message store

Internet

A B...

A

1. “send mail to bob”

2. Who manages bob? A

3. “Verify bob”

5. Pick the best nodes to store new msg C

DNS-RR selection

4. “OK, bob has msgs on C and D 6. “Store

msg”B

C

...C

Saito, 99

Page 19: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p19

Basic Data Structures“bob”

BCACABAC

bob: {A,C}ann: {B}

BCACABAC

suzy: {A,C} joe: {B}

BCACABAC

Apply hash function

User map

Mail map/user info

Mailbox storage

A B C

Bob’s MSGs

Suzy’s MSGs

Bob’s MSGs

Joe’s MSGs

Ann’s MSGs

Suzy’s MSGs

Saito, 99

Page 20: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p20

Performance in PorcupineGoals

Scale performance linearly with cluster size

Strategy: Avoid creating hot spotsPartition data uniformly among nodesFine-grain data partition

Saito, 99

Page 21: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p21

How does Performance Scale?

0

100

200

300

400

500

600

700

800

0 5 10 15 20 25 30Cluster size

Messages/second

Porcupine

sendmail+popd

68m/day

25m/day

Saito, 99

Page 22: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p22

Availability in Porcupine

Goals:Maintain function after failuresReact quickly to changes regardless of cluster sizeGraceful performance degradation / improvement

Strategy: Two complementary mechanismsHard state: email messages, user profile

Optimistic fine-grain replication

Soft state: user map, mail map Reconstruction after membership change

Saito, 99

Page 23: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p23

Soft-state Reconstruction

B C A B A B A C

bob: {A,C}

joe: {C}

B C A B A B A C

B A A B A B A B

bob: {A,C}

joe: {C}

B A A B A B A B

A C A C A C A C

bob: {A,C}

joe: {C}

A C A C A C A C

suzy: {A,B}

ann: {B}

1. Membership protocolUsermap recomputation

2. Distributed disk scan

suzy:

ann:

Timeline

A

B

ann: {B}

B C A B A B A C

suzy: {A,B}C ann: {B}

B C A B A B A C

suzy: {A,B}ann: {B}

B C A B A B A C

suzy: {A,B}

Saito, 99

Page 24: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p24

How does Porcupine React to Configuration Changes?

300

400

500

600

700

0 100 200 300 400 500 600 700 800Time(seconds)

Messages/second

No failure

One nodefailureThree nodefailuresSix nodefailures

Nodes fail

New membership determined

Nodes recover

New membership determined

Saito, 99

Page 25: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p25

Hard-state Replication

Goals:Keep serving hard state after failuresHandle unusual failure modes

Strategy: Exploit Internet semanticsOptimistic, eventually consistent

replicationPer-message, per-user-profile replicationEfficient during normal operationSmall window of inconsistency

Saito, 99

Page 26: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p26

How Efficient is Replication?

0

100

200

300

400

500

600

700

800

0 5 10 15 20 25 30Cluster size

Me

ss

ag

es

/se

co

nd

Porcupine no replication

Porcupine with replication=2

68m/day

24m/day

Saito, 99

Page 27: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p27

How Efficient is Replication?

0

100

200

300

400

500

600

700

800

0 5 10 15 20 25 30Cluster size

Me

ss

ag

es

/se

co

nd

Porcupine no replication

Porcupine with replication=2

Porcupine with replication=2, NVRAM

68m/day

24m/day33m/day

Saito, 99

Page 28: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p28

Load balancing: Deciding where to store messages

Goals:Handle skewed workload wellSupport hardware heterogeneityNo voodoo parameter tuning

Strategy: Spread-based load balancingSpread: soft limit on # of nodes per mailbox

Large spread better load balanceSmall spread better affinity

Load balanced within spreadUse # of pending I/O requests as the load

measure

Saito, 99

Page 29: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p29

How Well does Porcupine Support Heterogeneous Clusters?

0%

10%

20%

30%

0% 3% 7% 10%Number of fast nodes (% of total)

Th

rou

gh

pu

t in

crea

se(%

)

Spread=4

Static

+16.8m/day (+25%)

+0.5m/day (+0.8%)

Saito, 99

Page 30: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p30

Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail

Page 31: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p31

Other Approaches

• Monolithic server

• Porcupine?

• Cluster-based OS

• Distributed file system & frontend

• Static Partitioning

Manageabili

ty

Manageability & availability per dollar

Page 32: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p32

Consistency• Both systems use distribution and

replication to achieve their goals• Ideally, these should be properties of

the implementation, not the interface, I.e., they should be transparent.

• A common definition of “reasonable” behavior is transaction (ACID) semnatics

Page 33: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p33

ACID Properties Atomicity: A transaction’s changes to the state are

atomic: either all happen or none happen. These changes include database changes, messages, and actions on transducers.

Consistency: A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program.

Isolation: Even though transactions execute concurrently, it appears to each transaction T, that others executed either before T or after T, but not both.

Durability: Once a transaction completes successfully (commits), its changes to the state survive failures.

Reuter

Page 34: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p34

Consistency in Grapevine• Operations in Grapevine are not atomic

– Add name; Put name on list• Visible failure, name not available for 2nd op• Could stick with single server per session?• Problem for sysadmins, not general users

– Add user to distribution list; mail to list• Problem for general users• Invisible failure; mail not delivered to someone

– Distributed Garbage Collection (gc) is a well-known, hard problem

• Removing unused distribution lists is related

Page 35: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p35

Human Intervention• Grapevine has two types of

operators– Basic administrators– Experts

• In what ways is Porcupine easier to administer?– Automatic load balancing– Both do some dynamic resource

discovery

Page 36: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p36

Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail

Page 37: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p37

Characteristics of Mail• Scale: commercial services handle

10M messages per day• Write-intensive, following don’t work:

– Stateless transformation– Web caching

• Consistency requirements fairly weak– Compared to file systems or databases

Page 38: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p38

Other Applications• How would support for other

applications differ? – Web servers– File servers– Mobile network services– Sensor network services

• Read-mostly, write-mostly, or both• Disconnected operation (IMAP)• Continuous vs. Discrete input

Page 39: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p39

Harvest and Yield• Yield: probability of completing a query• Harvest: (application-specific) fidelity of

the answer– Fraction of data represented?– Precision?– Semantic proximity?

• Harvest/yield questions:– When can we trade harvest for yield to

improve availability?– How to measure harvest “threshold” below

which response is not useful?

Copyright Fox, 1999

Page 40: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p40

Search Engine• Stripe database randomly across all

nodes, replicate high-priority data– Random striping: worst case == average case– Replication: high priority data unlikely to be

lost– Harvest: fraction of nodes reporting

• Questions…– Why not just wait for all nodes to report back?– Should harvest be reported to end user?– What is the “useful” harvest threshold?– Is nondeterminism a problem?

• Trade harvest for yield/throughput

Copyright Fox, 1999

Page 41: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p41

General Questions• What do both systems to achieve:

– Parallelism (scalability)• Partitioned data structures

– Locality (performance)• Replication, scheduling of related

tasks/data

– Reliability• Replication, stable storage

• What are the trade-off?

Page 42: CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p42

Administrivia• Read wireless (Baker) paper for 9/7

– Short discussion next Thursday 9/7 (4:30-5:00 only)

• Read Network Objects paper for Tuesday

• How to get the Mitzenmacher paper for next week– Read tornado codes as well, if

interested


Recommended