Date post: | 18-Jan-2018 |
Category: |
Documents |
Upload: | loraine-marshall |
View: | 217 times |
Download: | 0 times |
1
Compositional Design and Analysis of Timing-Based Distributed Algorithms
Nancy LynchTheory of Distributed SystemsMIT
Third MURI WorkshopArlington-Ballston, VirginiaDecember 10, 2002
2
MIT Participants
• Leader– Nancy Lynch
• Postdoctoral associates– Idit Keidar, Dilsun Kirli
• Graduate students– Roger Khazan, Carl Livadas, Ziv Bar-Joseph, Rui Fan,
Seth Gilbert, Sayan Mitra• Collaborators
– Alex Shvartsman and students, Frits Vaandrager, Roberto Segala
3
Project Overview• Design and analyze distributed algorithms that implement
global services with strong guarantees, e.g.: – Reliable communication– Strongly coherent data services
• Dynamic environment, where processes join, leave, and fail.• Algorithms composed of sub-algorithms. • Analyze performance conditionally, under various
assumptions about timing and failures. • Develop underlying mathematical modeling framework,
based on interacting state machines (IOA, TIOA), capable of:– Describing precisely all the algorithms we study.– Supporting compositional and conditional analysis.
4
Algorithms Studied
• Scalable group communication [Khazan, Keidar]• Early-delivery dynamic atomic broadcast
[Bar-Joseph, Keidar, Lynch]• Reconfigurable atomic memory [Lynch, Shvartsman]• Scalable reliable multicast [Livadas, Keidar, Lynch]• In progress:
– Reconfigurable atomic memory– Peer-to-peer: Fault-tolerant location services, data services– Mobile: Topology control, clock synchronization, tracking
5
This Talk
I. Completed work: Scalable group communication Early-delivery dynamic atomic broadcast
II. Reconfigurable atomic memoryIII. Reliable multicastIV. Modeling frameworkV. Plans for the next two years
6
I. Completed Work Scalable Group Communication
[Keidar, Khazan 00, 02], [Khazan 02], [Keidar, Khazan, Lynch, Shvartsman 02]
[Taraschanskiy 00]
GCS
7
Group Communication Services
• Cope with changing participants using abstract groups of client processes with changing membership sets.
• Processes communicate with group members indirectly, by sending messages to the group as a whole.
• GC services support management of groups:– Maintain membership information.
• Form new views in response to changes.– Manage communication.
• Communication respects views.• Provide guarantees about ordering and reliability of message delivery.• Virtual synchrony
• Applications: Managing replicated data; distributed multiplayer games; collaborative work
8
Scalable GC Algorithm • Specification, including virtual synchrony.• New algorithm:
– Uses a scalable membership service, implemented on a small set of membership servers.
– Multicast implemented on all the nodes.
– View change uses only one round for state exchange, in parallel with membership service’s agreement on views.
– Participants can join during view formation.
GCS
Net
Memb
GCS
9
Analysis
• Safety proofs, using incremental proof methods.• Liveness proofs.• Performance analysis:
– Time from when network stabilizes until GCS announces final view.– Message latency.– Conditional analysis, based on input, failure, and timing assumptions.– Compositional analysis, based on performance of membership service
and Net.• Modeled and analyzed data-management application
running on top of the new GCS.• Distributed implementation [Taraschanskiy 00].
S S’A A’
10
Completed Work:
Early-Delivery Dynamic Atomic Broadcast
[Bar-Joseph, Keidar, Lynch 02]
DAB
11
Dynamic Atomic Broadcast
• Atomic broadcast, where processes may join, leave, or fail.• Safety: Sending, receiving orders are consistent with a single
global message ordering (no gaps).• Liveness: Eventual completion of joins, leaves. Eventual
delivery, including the process’ own messages.• Fast delivery, even with joins, leaves.• Application: Distributed multiplayer interactive games.
joinleavemcast(m)
join-ackleave-ackrcv(m)
…
DAB
12
Implementing DAB
• Processes:– Timing-dependent, have approximately-synchronized clocks.
• Net:– Pairwise FIFO delivery– Low latency– But does not guarantee a single total order, nor that all processes see
the same messages from a failing process.
join
net-join
DAB
Net
13
Dynamic Atomic Bcast Algorithm
• Processes coordinate message delivery:– Divide time into slots using local clock, assign messages to slots.– Deliver messages in order of (slot, sender id).– Determine members of each slot, deliver only messages from members.
• Processes must agree on slot membership:– Joining process selects join-slot, informs others. – Similarly for leaving process.– Failed process results in consensus on failure slot.
• Requires a new kind of consensus service: Consensus with Uncertain Participants (CUP).– Participants not known a priori.– Each participant submits its perceived “World”.– Processes may abstain.
14
DABi1 DABi2
CUP(j)
DAB
Net
fail fail
The DAB Algorithm Using CUP
15
Consensus with Uncertain Participants
• CUP Problem:– Guarantees agreement, validity, termination.– Assumes submitted worlds are “close”:
• Process that initiates is in other processes’ worlds• Process in anyone’s world initiates, abstains, leaves, or fails.
• CUP Algorithm– A new early-stopping consensus algorithm.– Similar to [Dolev, Reischuk, Strong 90], but:
• Tolerates uncertainty about participants.• Tolerates processes leaving.
– Terminates in two rounds when failures stop, even if leaves continue.– Latency linear in number of actual failures
16
Analysis
• Compositional analysis: Properties of CUP used to prove properties of DAB:– Safety: CUP agreement and validity imply DAB atomic broadcast
consistency guarantees.– Liveness: CUP safety and liveness properties (e.g., termination) imply
DAB liveness properties (e.g., eventual delivery).– Latency: CUP decision bounds imply DAB message delay bounds.
• Message latency:– No failures: Constant, even when participants join and leave.– With failures: Linear in the number of failures.– Improves upon algorithms using group communication.
17
II. Reconfigurable Atomic Memory for Dynamic Distributed Environments
[Lynch, Shvartsman 02]
RAMBO
18
Reconfigurable Atomic Memory• Implement atomic read/write shared memory in a
dynamic network setting.– Participants may join, leave, fail.– Mobile networks, peer-to-peer networks.
• High availability, low latency.• Atomicity for all patterns of asynchrony and change.• Good performance under reasonable limits on
asynchrony and change.• Applications:
– Battle data for teams of soldiers in military operation.– Game data for players in multiplayer game.
19
Approach: Dynamic Quorums
• Objects are replicated at several network locations.• To accommodate small, transient changes:
– Uses quorum configurations: members, read-quorums, write-quorums.
– Maintains atomicity during stable situations.– Allows concurrency.
• To handle larger, more permanent changes:– Reconfigure– Maintains atomicity across configuration changes.– Any configuration can be installed at any time.– Reconfigure concurrently with reads/writes; no
heavyweight view change.
20
RAMBO
• Reconfigurable Atomic Memory for Basic Objects (dynamic atomic read/write shared memory).
• Global service specification:
• Algorithm: – Reads and writes objects.– Chooses new configurations, notifies members.– Identifies, garbage-collects obsolete configurations.– All concurrently.
RAMBO
21
• Main algorithm + reconfiguration service• Loosely coupled• Recon service:
– Provides a consistent sequence of configurations.
• Main algorithm: – Handles reading, writing.– Receives, disseminates new configuration information; no
formal installation.– Garbage-collects old configurations.– Reads/writes may use several configurations.
Recon
Net
Recon
RRAMBO
RAMBO Algorithm Structure
22
Main algorithm: Reading and Writing
• Run a version of the standard static two-phase quorum-based read/write algorithm [Vitanyi, Awerbuch], [Attiya, Bar-Noy, Dolev].
• Use all current configurations.
read,write
Net
Recon
new-config
23
Static Read/write Protocol
• Quorum configuration:– read-quorums, write-quorums– For any RR in read-quorums, WW in write-quorums, RR W W ..
• Replicate object at all locations.• At each location, keep:
– value– tag = (sequence number, location)
• Read, write use two phases:– Phase 1: Read (value, tag) from a read-quorum– Phase 2: Write (value,tag) to a write-quorum
• Highly concurrent.• Quorum intersection implies atomicity
24
Static Read/write Protocol Details• Write at location i:
– Phase 1: • Read (value, tag) from a read-quorum.• Determine largest seq-number among the tags that are read. • Choose new-tag := (larger sequence-number, i).
– Phase 2: • Propagate (new-value, new-tag) to a write-quorum.
• Read at location i:– Phase 1:
• Read (value, tag) from a read-quorum.• Determine largest (value,tag) among those read.
– Phase 2:• Propagate this (value,tag) to a write-quorum.• Return value.
25
Dynamic Read/write Protocol• Perform two-phase static protocol, using all current
configurations.– Phase 1: Collect object values from read-quorums of current
configurations.– Phase 2: Propagate latest value to write-quorums of current
configurations.• When new configuration is provided by Recon:
– Start using it too.– Do not abort reads/writes in progress, but do extra work to access
additional processes needed for new quorums.• Our communication mechanism:
– Background gossiping– Terminate by fixed-point condition, involving a quorum from each
active configuration.
26
Removing Old Configurations
• Garbage-collect them in the background.• Two-phase garbage-collection procedure:
– Phase 1: • Inform write-quorum of old configuration about the
new configuration. • Collect object values from read-quorum of the old
configuration.– Phase 2:
• Propagate the latest value to a write-quorum of the new configuration.
• Garbage-collection concurrent with reads/writes.• Implemented using gossiping and fixed points.
27
Implementation of Recon• Uses distributed consensus to determine successive
configurations 1, 2, 3,…
• Members of old configuration propose new configuration.• Proposals reconciled using consensus.• Consensus is a heavyweight mechanism, but:
– Only used for reconfigurations, infrequent.– Does not delay read/write operations.
Consensus
Recon
Net
28
Consensus Implementation
• Use a variant of timing-based Paxos algorithm [Lamport]• Agreement, validity guaranteed absolutely (independent of
timing).• Termination guaranteed when underlying system stabilizes.• Leader chosen using failure detectors; conducts two-phase
algorithm with retries.
decide(v)init(v)
init(v)
Consensus
29
Analysis
• We prove atomicity for arbitrary patterns of asynchrony and change, using partial order methods.
• Analyze performance conditionally, based on failure and timing assumptions.
• E.g., under reasonable “steady-state” assumptions:– Removing old configurations takes time at most 6d.– Reads and writes take time at most 8d.
• LAN implementation [Musial 02].
30
Other Approaches• Use consensus to agree on total order of operations:
[Lamport 89]– Not resilient to transient failures. – Termination of reads/writes depends on termination of
consensus.• Totally-ordered broadcast over group communication:
[Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96]– View formation takes a long time, delays reads/writes.– One change may trigger view formation.
31
III. Reliable Multicast Protocols[Livadas, Keidar, Lynch 01],
[Livadas, Keidar 02], [Livadas, Lynch 02]
32
Physical System Model
• Infinite # of symmetric hostsi.e., same resources, processes
• Network of interconnected routers• Failures: fail-stop host crashes and packet drops
r1r2
r6 r4r5
r3h1
h2
h3
h4
h5
h6
33
Reliable Multicast Service (RMS)
Overview:– Single reliable multicast group & single client process/host– RM() encompasses behavior of all other processes on hosts and
functionality of underlying network– Parameter bounds the reliable delivery delay
Membership:– A host becomes a member of the group upon the acknowledgment of
its join request– A host ceases to be a member upon issuing a leave request
RM-Client1 RM-Client2
RM()
rm-join1 rm-join-ack1
RM-Client1
rm-send1(p)
RM-Client1
rm-recv2(p)
RM-Client1RM-Client1
rm-leave1
RM-Client1
34
Multicast Reliability: Properties
Let h,s be hosts and p,p’ be packets from s such that p<p’
Eventual Delivery: If p’ remains active forever after its transmission, h delivers p, and h remains a member thereafter, then h delivers p’.
Time-Bounded Delivery: Let T denote the time interval ranging from the transmission time t of p’ to the point in time time units past t.If p’ remains active throughout T, h delivers p prior to the expiration of T, and h remains a member thereafter within T, then h delivers p’ within T.
35
Reliable Multicast Implementation (RMI)
• Scalable Reliable Multicast (SRM) [Floyd et. al., 97]– Retransmission-based protocol using NACKs– Uses best-effort IP multicast as communication primitive
• Augment SRM so as to precisely specify:– when a host becomes a member of the group– which packets each member should attempt to recover
36
SRM’s Recovery Scheme
• Each host schedules a request for each missing packet• Any capable host schedules a reply to each such request• Duplicate requests/replies limited using deterministic and probabilistic
suppression schemes
hh’
s
rqstrepl
37
IP-mcast
RM-Client1
RM-mem1RM-rep1
RM-Client2
RM-mem2RM-rep2
RM-IPbuff1 RM-rec1 RM-IPbuff2 RM-rec2
RMI Timed I/O Automaton Model
38
Analysis of RMI
Correctness Analysis:RMI implements RMS; i.e., RMI delivers appropriate packets to appropriate members of the reliable multicast group as dictated by RMS.
Conditional Timeliness Analysis:Presuming no leaves, no crashes, bounded transmission latencies and latency estimates, bounded loss detection delays, and a fixed number k of packet drops per packet transmission/recovery, packets are guaranteed delivery within particular delivery delay upper bound (k).
39
Byproduct of RMI Timeliness Analysis
• Constraints on SRM scheduling parameters– C3 < C1 : back-off abstinence does not affect next round
requests– D1 + D2 + 2 < 2 C1: replies received prior to transmission
of next round requests– D1 + D2 + D3 < 2 C1: requests not discarded due to prior
round reply abstinence• Violating these guidelines may lead to superfluous
traffic and unwarranted recovery round failure
40
Caching-Enhanced SRM (CESRM)
• Enhance SRM with caching scheme– determines and caches optimal requestor/replier pair for each loss– expedites recovery of losses based on requestor/replier pair cache
hh’
s
exp-rqst
exp-repl
41
CESRM Timed I/O Automaton Model
IP-mcast
RM-Client1
RM-mem1RM-rep1
RM-Client2
RM-mem2RM-rep2
RM-IPbuff1 RM-rec1 RM-IPbuff2 RM-rec2RM-IPbuff1 RM-rec1 RM-IPbuff2 RM-rec2
IP-ucast
42
CESRM: Conditional Timeliness Analysis
Definition: A cache hit corresponds to a recovery scenario in which:– hosts that share the loss also share optimal requestor-replier pair,– the optimal requestor shares the loss, and – the optimal replier does not share the loss.
Claim:For any execution where no recovery packets are dropped, cache hits lead to packet recovery within at most:DET-BOUND+ dreorder-delay+2d+
as opposed to:DET-BOUND+(C1+C2)d++d++(D1+D2)d++d+
For C1=C2=D1=D2=1, worst-case recovery delay following detection reduced from ~3 RTT to ~1
RTT
43
Estimating the Frequency of Cache Hits
• Analyzed 14 multicast transmission traces [Yajnik et al. 95/96]• On average, ~1/3 of losses recoverable by expedited recoveries• More precise identification of loss locations may lead to the recovery of
~1/2 of losses by expedited recoveries
Abstract loss location representationActual loss location representation
44
IV. Modeling Framework
• To support all this analysis, we need a well-designed mathematical foundation, capable of:– Describing all the algorithms we want to consider.– Supporting compositional and conditional analysis.
• We use a framework based on interacting state machines.– Basic asynchronous model (I/O automata)– Augmented models: Timed, hybrid (continuous/discrete),
probabilistic.
45
I/O Automata [Lynch, Tuttle 87]
• Nondeterministic, infinite-state automata– States, start states– Actions: Input, output, internal – Transitions (s,a,s’)– Executions, traces– A implements B if traces(A) traces(B)
• Describing system modularity:– Parallel composition – Levels of abstraction
• Reasoning methods:– Invariant assertions– Simulation relations– Compositional methods
• Used to describe asynchronous distributed algorithms.
46
Timed I/O Automata (TIOA) [Merritt, Modugno,Tuttle], [Lynch, Vaandrager]
• Add time-passage actions • Used to describe:
– Timeout-based algorithms.– Local clocks, clock synchronization.– Timing/performance characteristics.
47
Hybrid I/O Automata (HIOA) [Lynch, Segala, Vaandrager 01, 02]
• Automata with continuous and discrete transitions– States: Input, output, internal variables; start states– Actions: Input, output, internal– Discrete transitions (s,a,s’)– Trajectories , mapping time intervals to states– Execution 0 a1 1 a2 2 …– Trace: Project on external variables, external actions.– A implements B if traces(A) traces(B).
• Composition, levels of abstraction.• Invariants, simulation relations, compositional reasoning• Used to describe:
– Controlled systems– Automated transportation systems– Embedded systems
48
Timed I/O Automata (TIOA), Revisited[Lynch, Segala, Vaandrager, Kirli]
• Have reformulated TIOA as a special case of HIOA: – No external variables: states consist of internal variables only.
• Use trajectories to describe time-passage, instead of time-passage actions.
• Monograph on modeling timed systems:– Theory– Analysis methods– Examples– Relationships with other timed models
[Alur, Dill], [Merritt, Modugno, Tuttle], [Maler, Manna, Pnueli]
49
Probabilistic Automata (PIOA, PTIOA)[Segala 95] [Segala, Vaandrager, Lynch 02]
• Add probabilistic transitions (s,a,)• Work in progress [Segala, Vaandrager, Lynch],
[de Alfaro, Henzinger]:– External behavior notion.– Composition theorems. – Implementation relationships
• Used to describe:– Probabilistic and nondeterministic behavior.– Randomized distributed algorithms– Systems with probabilistic assumptions
50
V. Plans for the Next Two Years
51
Plans: Distributed Algorithms• Reconfigurable atomic memory
– LAN implementation [Musial, Shvartsman]– More analysis:
• “Normal behavior” starting from some point– Algorithmic improvements:
• Concurrent garbage-collection [Gilbert]• Reduced communication• Better join protocol• Faster reads
– Extensions:• “Leave” protocol• Backup strategies for when configurations fail• Support for choosing configurations
52
Plans: Distributed Algorithms• Reliable multicast protocols [Livadas]:
– Extend SRM analysis to handle nodes leaving and failing.– Finish CESRM analysis.– Analyze LMS protocol [Papadopoulos, Varghese 98].
• Mobile systems: – Topology control [Hajiaghayi, Mirrokni]– Time synchronization – Tracking – Resource allocation– Data management
• Peer-to-peer systems [Lynch, Stoica]: – Location services that are provably fault-tolerant under reasonable
steady-state assumptions. – Data management over location services
53
Plans: Semantic Framework
• Timed models: – Composition theorems for timing properties.– Structured TIOAs to support conditional performance
analysis.– Relate TIOA to other models, e.g., reactive modules
[Alur, Henzinger].• Probabilistic models:
– Composition theorems [de Alfaro, Henzinger]• Integrate timed and probabilistic models into one
semantic framework.