Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 220 times |
Download: | 0 times |
CS294, Yelick Applications, p1
CS 294-8Applications of Reliable
Distributed Systemshttp://www.cs.berkeley.edu/~yelick/294
CS294, Yelick Applications, p2
Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail
CS294, Yelick Applications, p3
Specialization vs. Single Solution
• What does Grapevine do?– Message delivery (mail)– Naming, authentication, & access
control– Resource location
• What does Porcupine do?– Primarily a mail server– Uses DNS for naming
• Difference in distributed systems infrastructure
CS294, Yelick Applications, p4
Grapevine Prototype• 1983 configuration
– 17 servers (typically Altos)• 128 KB memory, 5MB disk, 30usec proc call
– 4400 individuals and 1500 groups– 8500 messages, 35,000 receptions per
day• Designed for up to
– 30 servers, 10K users• Used as actual mail server at Parc
– Grew from 5 to 30 servers over 3 years
CS294, Yelick Applications, p5
Porcupine Prototype• 30-node PC cluster (not-quite
identical) – Linux 2.2.7– 42,000 lines of C++ code– 100Mb/s Ethernet + 1Gb/s hubs
• Designed for up to– 1 billion messages/day
• Synthetic load
CS294, Yelick Applications, p6
Functional Homogeneity• Any node can perform any function
– Why did they consider abandoning it in Grapevine? Other internet services?
Functional homogeneity
Replication
Availability
Principle
Automatic reconfiguration
Dynamic Scheduling
Manageability Performance
Technique
Goals
CS294, Yelick Applications, p7
Evolutionary Growth Principle• To grow over time, a system must use
scalable data structures and algorithms
• Given p nodes– O(1) memory per node– O(1) time to perform important operations
• This is “ideal” but often impractical– E.g., O(log(p)) may be fine
• Each order of magnitude “feels” different: under 10, 100, 1K, 10K
CS294, Yelick Applications, p8
Separation of Hard/Soft State• Hard state: information that cannot
be lost, must use stable storage. Also uses replication to increase availability.– E.g., message bodies, passwords.
• Soft state: information that can be constructed from hard state. Not replicated (except for performance).– E.g., list of nodes containing user’s mail
CS294, Yelick Applications, p9
Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail
CS294, Yelick Applications, p10
Sending a Message: Grapevine
• User calls Grapevine User Package (GVUP) on own machine• GVUP broadcasts looking for servers• Name server returns list of registration servers• GVUP selects 1 and send mail to it• Mail server looks up name in “to” field• Connects to server with primary or secondary inbox for that name
client
server
registration
server
primary inbox
server
secondary inbox
GVUP
CS294, Yelick Applications, p11
Replication in Grapevine• Sender side replication
– Any node can accept any mail• Receiver side replication
– Every user has 2 copies of their inbox• Messages bodies are not replicated
– Stored on disk; almost always recoverable
– Message bodies are shared among recipients (4.7x sharing on average)
• What conclusions can you draw?
CS294, Yelick Applications, p12
Reliability Limits in Grapevine
• Only one copy of message body• Direct connection between mail
server and (one of 2) inbox machines
• Others?
CS294, Yelick Applications, p13
Limits to Scaling in Grapevine
• Every registration server knows the names of all (15 KB for 17 nodes)– Registration servers– Registries: logical groups/mailing lists – Could add hierarchy for scaling
• Resource discovery– Linear search through all servers to
find the “closest” – How important is distance?
CS294, Yelick Applications, p14
Configuration Questions• When to add servers:
– When load is too high– When network is unreliable
• Where to distribute registries• Where to distribute inboxes• All decisions made by humans in Grapevine
– Some rules of thumb, e.g., for reg and mail servers, the primary inbox is local, the second is nearby and a third is at the other end of the internet.
– Is there something fundamental here? Handing node failures and link failures (partitions)?
CS294, Yelick Applications, p15
Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail
CS294, Yelick Applications, p16
Porcupine Structures• Mailbox fragment: chunk of mail messages
for 1 user (hard)• Mail map: list of nodes containing
fragments for each user (soft)• User profile db: names, passwords,…
(hard)• User profile soft state: copy of profile, used
for performance (soft)• User map: maps user name (hashed) to
node currently storing mail map and profile• Cluster membership: nodes currently
available (soft, but replicated)
Saito, 99
CS294, Yelick Applications, p17
Porcupine Architecture
Node A
Node B
Node Z
......
SMTPserver
POPserver
IMAPserver
MailmapMailbox
managerUser DBmanager
Replication Manager
Membership Manager
RPC
Load Balancer User map
Saito, 99
User manager
UserProfile
soft
CS294, Yelick Applications, p18
Porcupine OperationsProtocol handling
User lookup
Load Balancing
Message store
Internet
A B...
A
1. “send mail to bob”
2. Who manages bob? A
3. “Verify bob”
5. Pick the best nodes to store new msg C
DNS-RR selection
4. “OK, bob has msgs on C and D 6. “Store
msg”B
C
...C
Saito, 99
CS294, Yelick Applications, p19
Basic Data Structures“bob”
BCACABAC
bob: {A,C}ann: {B}
BCACABAC
suzy: {A,C} joe: {B}
BCACABAC
Apply hash function
User map
Mail map/user info
Mailbox storage
A B C
Bob’s MSGs
Suzy’s MSGs
Bob’s MSGs
Joe’s MSGs
Ann’s MSGs
Suzy’s MSGs
Saito, 99
CS294, Yelick Applications, p20
Performance in PorcupineGoals
Scale performance linearly with cluster size
Strategy: Avoid creating hot spotsPartition data uniformly among nodesFine-grain data partition
Saito, 99
CS294, Yelick Applications, p21
How does Performance Scale?
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30Cluster size
Messages/second
Porcupine
sendmail+popd
68m/day
25m/day
Saito, 99
CS294, Yelick Applications, p22
Availability in Porcupine
Goals:Maintain function after failuresReact quickly to changes regardless of cluster sizeGraceful performance degradation / improvement
Strategy: Two complementary mechanismsHard state: email messages, user profile
Optimistic fine-grain replication
Soft state: user map, mail map Reconstruction after membership change
Saito, 99
CS294, Yelick Applications, p23
Soft-state Reconstruction
B C A B A B A C
bob: {A,C}
joe: {C}
B C A B A B A C
B A A B A B A B
bob: {A,C}
joe: {C}
B A A B A B A B
A C A C A C A C
bob: {A,C}
joe: {C}
A C A C A C A C
suzy: {A,B}
ann: {B}
1. Membership protocolUsermap recomputation
2. Distributed disk scan
suzy:
ann:
Timeline
A
B
ann: {B}
B C A B A B A C
suzy: {A,B}C ann: {B}
B C A B A B A C
suzy: {A,B}ann: {B}
B C A B A B A C
suzy: {A,B}
Saito, 99
CS294, Yelick Applications, p24
How does Porcupine React to Configuration Changes?
300
400
500
600
700
0 100 200 300 400 500 600 700 800Time(seconds)
Messages/second
No failure
One nodefailureThree nodefailuresSix nodefailures
Nodes fail
New membership determined
Nodes recover
New membership determined
Saito, 99
CS294, Yelick Applications, p25
Hard-state Replication
Goals:Keep serving hard state after failuresHandle unusual failure modes
Strategy: Exploit Internet semanticsOptimistic, eventually consistent
replicationPer-message, per-user-profile replicationEfficient during normal operationSmall window of inconsistency
Saito, 99
CS294, Yelick Applications, p26
How Efficient is Replication?
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30Cluster size
Me
ss
ag
es
/se
co
nd
Porcupine no replication
Porcupine with replication=2
68m/day
24m/day
Saito, 99
CS294, Yelick Applications, p27
How Efficient is Replication?
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30Cluster size
Me
ss
ag
es
/se
co
nd
Porcupine no replication
Porcupine with replication=2
Porcupine with replication=2, NVRAM
68m/day
24m/day33m/day
Saito, 99
CS294, Yelick Applications, p28
Load balancing: Deciding where to store messages
Goals:Handle skewed workload wellSupport hardware heterogeneityNo voodoo parameter tuning
Strategy: Spread-based load balancingSpread: soft limit on # of nodes per mailbox
Large spread better load balanceSmall spread better affinity
Load balanced within spreadUse # of pending I/O requests as the load
measure
Saito, 99
CS294, Yelick Applications, p29
How Well does Porcupine Support Heterogeneous Clusters?
0%
10%
20%
30%
0% 3% 7% 10%Number of fast nodes (% of total)
Th
rou
gh
pu
t in
crea
se(%
)
Spread=4
Static
+16.8m/day (+25%)
+0.5m/day (+0.8%)
Saito, 99
CS294, Yelick Applications, p30
Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail
CS294, Yelick Applications, p31
Other Approaches
• Monolithic server
• Porcupine?
• Cluster-based OS
• Distributed file system & frontend
• Static Partitioning
Manageabili
ty
Manageability & availability per dollar
CS294, Yelick Applications, p32
Consistency• Both systems use distribution and
replication to achieve their goals• Ideally, these should be properties of
the implementation, not the interface, I.e., they should be transparent.
• A common definition of “reasonable” behavior is transaction (ACID) semnatics
CS294, Yelick Applications, p33
ACID Properties Atomicity: A transaction’s changes to the state are
atomic: either all happen or none happen. These changes include database changes, messages, and actions on transducers.
Consistency: A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program.
Isolation: Even though transactions execute concurrently, it appears to each transaction T, that others executed either before T or after T, but not both.
Durability: Once a transaction completes successfully (commits), its changes to the state survive failures.
Reuter
CS294, Yelick Applications, p34
Consistency in Grapevine• Operations in Grapevine are not atomic
– Add name; Put name on list• Visible failure, name not available for 2nd op• Could stick with single server per session?• Problem for sysadmins, not general users
– Add user to distribution list; mail to list• Problem for general users• Invisible failure; mail not delivered to someone
– Distributed Garbage Collection (gc) is a well-known, hard problem
• Removing unused distribution lists is related
CS294, Yelick Applications, p35
Human Intervention• Grapevine has two types of
operators– Basic administrators– Experts
• In what ways is Porcupine easier to administer?– Automatic load balancing– Both do some dynamic resource
discovery
CS294, Yelick Applications, p36
Agenda• Design principles• Specifics of Grapevine• Specifics of Porcupine• Comparisons• Other applications besides mail
CS294, Yelick Applications, p37
Characteristics of Mail• Scale: commercial services handle
10M messages per day• Write-intensive, following don’t work:
– Stateless transformation– Web caching
• Consistency requirements fairly weak– Compared to file systems or databases
CS294, Yelick Applications, p38
Other Applications• How would support for other
applications differ? – Web servers– File servers– Mobile network services– Sensor network services
• Read-mostly, write-mostly, or both• Disconnected operation (IMAP)• Continuous vs. Discrete input
CS294, Yelick Applications, p39
Harvest and Yield• Yield: probability of completing a query• Harvest: (application-specific) fidelity of
the answer– Fraction of data represented?– Precision?– Semantic proximity?
• Harvest/yield questions:– When can we trade harvest for yield to
improve availability?– How to measure harvest “threshold” below
which response is not useful?
Copyright Fox, 1999
CS294, Yelick Applications, p40
Search Engine• Stripe database randomly across all
nodes, replicate high-priority data– Random striping: worst case == average case– Replication: high priority data unlikely to be
lost– Harvest: fraction of nodes reporting
• Questions…– Why not just wait for all nodes to report back?– Should harvest be reported to end user?– What is the “useful” harvest threshold?– Is nondeterminism a problem?
• Trade harvest for yield/throughput
Copyright Fox, 1999
CS294, Yelick Applications, p41
General Questions• What do both systems to achieve:
– Parallelism (scalability)• Partitioned data structures
– Locality (performance)• Replication, scheduling of related
tasks/data
– Reliability• Replication, stable storage
• What are the trade-off?
CS294, Yelick Applications, p42
Administrivia• Read wireless (Baker) paper for 9/7
– Short discussion next Thursday 9/7 (4:30-5:00 only)
• Read Network Objects paper for Tuesday
• How to get the Mitzenmacher paper for next week– Read tornado codes as well, if
interested