CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

CS294, Yelick Applications, p1

CS 294-8Applications of Reliable

Distributed Systemshttp://www.cs.berkeley.edu/~yelick/294


Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail


Specialization vs. Single Solution

• What does Grapevine do?– Message delivery (mail)– Naming, authentication, & access

control– Resource location

• What does Porcupine do?– Primarily a mail server– Uses DNS for naming

• Difference in distributed systems infrastructure


Grapevine Prototype• 1983 configuration

– 17 servers (typically Altos)• 128 KB memory, 5MB disk, 30usec proc call

– 4400 individuals and 1500 groups– 8500 messages, 35,000 receptions per

day• Designed for up to

– 30 servers, 10K users• Used as actual mail server at Parc

– Grew from 5 to 30 servers over 3 years


Porcupine Prototype• 30-node PC cluster (not-quite

identical) – Linux 2.2.7– 42,000 lines of C++ code– 100Mb/s Ethernet + 1Gb/s hubs

• Designed for up to– 1 billion messages/day

• Synthetic load


Functional Homogeneity• Any node can perform any function

– Why did they consider abandoning it in Grapevine? Other internet services?

Functional homogeneity

Replication

Availability

Principle

Automatic reconfiguration

Dynamic Scheduling

Manageability Performance

Technique

Goals


Evolutionary Growth Principle• To grow over time, a system must use

scalable data structures and algorithms

• Given p nodes– O(1) memory per node– O(1) time to perform important operations

• This is “ideal” but often impractical– E.g., O(log(p)) may be fine

• Each order of magnitude “feels” different: under 10, 100, 1K, 10K


Separation of Hard/Soft State• Hard state: information that cannot

be lost, must use stable storage. Also uses replication to increase availability.– E.g., message bodies, passwords.

• Soft state: information that can be constructed from hard state. Not replicated (except for performance).– E.g., list of nodes containing user’s mail




Sending a Message: Grapevine

• User calls Grapevine User Package (GVUP) on own machine• GVUP broadcasts looking for servers• Name server returns list of registration servers• GVUP selects 1 and send mail to it• Mail server looks up name in “to” field• Connects to server with primary or secondary inbox for that name

client

server

registration

mail

server

primary inbox

server

secondary inbox

GVUP


Replication in Grapevine• Sender side replication

– Any node can accept any mail• Receiver side replication

– Every user has 2 copies of their inbox• Messages bodies are not replicated

– Stored on disk; almost always recoverable

– Message bodies are shared among recipients (4.7x sharing on average)

• What conclusions can you draw?


Reliability Limits in Grapevine

• Only one copy of message body• Direct connection between mail

server and (one of 2) inbox machines

• Others?


Limits to Scaling in Grapevine

• Every registration server knows the names of all (15 KB for 17 nodes)– Registration servers– Registries: logical groups/mailing lists – Could add hierarchy for scaling

• Resource discovery– Linear search through all servers to

find the “closest” – How important is distance?


Configuration Questions• When to add servers:

– When load is too high– When network is unreliable

• Where to distribute registries• Where to distribute inboxes• All decisions made by humans in Grapevine

– Some rules of thumb, e.g., for reg and mail servers, the primary inbox is local, the second is nearby and a third is at the other end of the internet.

– Is there something fundamental here? Handing node failures and link failures (partitions)?




Porcupine Structures• Mailbox fragment: chunk of mail messages

for 1 user (hard)• Mail map: list of nodes containing

fragments for each user (soft)• User profile db: names, passwords,…

(hard)• User profile soft state: copy of profile, used

for performance (soft)• User map: maps user name (hashed) to

node currently storing mail map and profile• Cluster membership: nodes currently

available (soft, but replicated)

Saito, 99


Porcupine Architecture

Node A

Node B

Node Z

......

SMTPserver

POPserver

IMAPserver

MailmapMailbox

managerUser DBmanager

Replication Manager

Membership Manager

RPC

Load Balancer User map

Saito, 99

User manager

UserProfile

soft


Porcupine OperationsProtocol handling

User lookup

Load Balancing

Message store

Internet

A B...

A

1. “send mail to bob”

2. Who manages bob? A

3. “Verify bob”

5. Pick the best nodes to store new msg C

DNS-RR selection

4. “OK, bob has msgs on C and D 6. “Store

msg”B

C

...C

Saito, 99


Basic Data Structures“bob”

BCACABAC

bob: {A,C}ann: {B}

BCACABAC

suzy: {A,C} joe: {B}

BCACABAC

Apply hash function

User map

Mail map/user info

Mailbox storage

A B C

Bob’s MSGs

Suzy’s MSGs

Bob’s MSGs

Joe’s MSGs

Ann’s MSGs

Suzy’s MSGs

Saito, 99


Performance in PorcupineGoals

Scale performance linearly with cluster size

Strategy: Avoid creating hot spotsPartition data uniformly among nodesFine-grain data partition

Saito, 99


How does Performance Scale?

0

100

200

300

400

500

600

700

800

0 5 10 15 20 25 30Cluster size

Messages/second

Porcupine

sendmail+popd

68m/day

25m/day

Saito, 99


Availability in Porcupine

Goals:Maintain function after failuresReact quickly to changes regardless of cluster sizeGraceful performance degradation / improvement

Strategy: Two complementary mechanismsHard state: email messages, user profile

Optimistic fine-grain replication

Soft state: user map, mail map Reconstruction after membership change

Saito, 99


Soft-state Reconstruction

B C A B A B A C

bob: {A,C}

joe: {C}

B C A B A B A C

B A A B A B A B

bob: {A,C}

joe: {C}

B A A B A B A B

A C A C A C A C

bob: {A,C}

joe: {C}

A C A C A C A C

suzy: {A,B}

ann: {B}

1. Membership protocolUsermap recomputation

2. Distributed disk scan

suzy:

ann:

Timeline

A

B

ann: {B}

B C A B A B A C

suzy: {A,B}C ann: {B}

B C A B A B A C

suzy: {A,B}ann: {B}

B C A B A B A C

suzy: {A,B}

Saito, 99


How does Porcupine React to Configuration Changes?

300

400

500

600

700

0 100 200 300 400 500 600 700 800Time(seconds)

Messages/second

No failure

One nodefailureThree nodefailuresSix nodefailures

Nodes fail

New membership determined

Nodes recover

New membership determined

Saito, 99


Hard-state Replication

Goals:Keep serving hard state after failuresHandle unusual failure modes

Strategy: Exploit Internet semanticsOptimistic, eventually consistent

replicationPer-message, per-user-profile replicationEfficient during normal operationSmall window of inconsistency

Saito, 99


How Efficient is Replication?

0

100

200

300

400

500

600

700

800

0 5 10 15 20 25 30Cluster size

Me

ss

ag

es

/se

co

nd

Porcupine no replication

Porcupine with replication=2

68m/day

24m/day

Saito, 99


How Efficient is Replication?

0

100

200

300

400

500

600

700

800

0 5 10 15 20 25 30Cluster size

Me

ss

ag

es

/se

co

nd

Porcupine no replication

Porcupine with replication=2

Porcupine with replication=2, NVRAM

68m/day

24m/day33m/day

Saito, 99


Load balancing: Deciding where to store messages

Goals:Handle skewed workload wellSupport hardware heterogeneityNo voodoo parameter tuning

Strategy: Spread-based load balancingSpread: soft limit on # of nodes per mailbox

Large spread better load balanceSmall spread better affinity

Load balanced within spreadUse # of pending I/O requests as the load

measure

Saito, 99


How Well does Porcupine Support Heterogeneous Clusters?

0%

10%

20%

30%

0% 3% 7% 10%Number of fast nodes (% of total)

Th

rou

gh

pu

t in

crea

se(%

)

Spread=4

Static

+16.8m/day (+25%)

+0.5m/day (+0.8%)

Saito, 99




Other Approaches

• Monolithic server

• Porcupine?

• Cluster-based OS

• Distributed file system & frontend

• Static Partitioning

Manageabili

ty

Manageability & availability per dollar


Consistency• Both systems use distribution and

replication to achieve their goals• Ideally, these should be properties of

the implementation, not the interface, I.e., they should be transparent.

• A common definition of “reasonable” behavior is transaction (ACID) semnatics


ACID Properties Atomicity: A transaction’s changes to the state are

atomic: either all happen or none happen. These changes include database changes, messages, and actions on transducers.

Consistency: A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program.

Isolation: Even though transactions execute concurrently, it appears to each transaction T, that others executed either before T or after T, but not both.

Durability: Once a transaction completes successfully (commits), its changes to the state survive failures.

Reuter


Consistency in Grapevine• Operations in Grapevine are not atomic

– Add name; Put name on list• Visible failure, name not available for 2nd op• Could stick with single server per session?• Problem for sysadmins, not general users

– Add user to distribution list; mail to list• Problem for general users• Invisible failure; mail not delivered to someone

– Distributed Garbage Collection (gc) is a well-known, hard problem

• Removing unused distribution lists is related


Human Intervention• Grapevine has two types of

operators– Basic administrators– Experts

• In what ways is Porcupine easier to administer?– Automatic load balancing– Both do some dynamic resource

discovery




Characteristics of Mail• Scale: commercial services handle

10M messages per day• Write-intensive, following don’t work:

– Stateless transformation– Web caching

• Consistency requirements fairly weak– Compared to file systems or databases


Other Applications• How would support for other

applications differ? – Web servers– File servers– Mobile network services– Sensor network services

• Read-mostly, write-mostly, or both• Disconnected operation (IMAP)• Continuous vs. Discrete input


Harvest and Yield• Yield: probability of completing a query• Harvest: (application-specific) fidelity of

the answer– Fraction of data represented?– Precision?– Semantic proximity?

• Harvest/yield questions:– When can we trade harvest for yield to

improve availability?– How to measure harvest “threshold” below

which response is not useful?

Copyright Fox, 1999


Search Engine• Stripe database randomly across all

nodes, replicate high-priority data– Random striping: worst case == average case– Replication: high priority data unlikely to be

lost– Harvest: fraction of nodes reporting

• Questions…– Why not just wait for all nodes to report back?– Should harvest be reported to end user?– What is the “useful” harvest threshold?– Is nondeterminism a problem?

• Trade harvest for yield/throughput

Copyright Fox, 1999


General Questions• What do both systems to achieve:

– Parallelism (scalability)• Partitioned data structures

– Locality (performance)• Replication, scheduling of related

tasks/data

– Reliability• Replication, stable storage

• What are the trade-off?


Administrivia• Read wireless (Baker) paper for 9/7

– Short discussion next Thursday 9/7 (4:30-5:00 only)

• Read Network Objects paper for Tuesday

• How to get the Mitzenmacher paper for next week– Read tornado codes as well, if

interested

Date post:	22-Dec-2015
Category:	Documents
View:	220 times
Download:	0 times

CS294, YelickApplications, p1 CS 294-8 Applications of Reliable Distributed Systems yelick/294.

Documents