CSCE 668DISTRIBUTED ALGORITHMS AND SYSTEMS
Fall 2011Prof. Jennifer WelchCSCE 668
Set 16: Distributed Shared Memory 1
Distributed Shared Memory
CSCE 668Set 16: Distributed Shared Memory
2
A model for inter-process communication Provides illusion of shared variables on
top of message passing Shared memory is often considered a
more convenient programming platform than message passing
Formally, give a simulation of the shared memory model on top of the message passing model
We'll consider the special case of no failures only read/write variables to be simulated
The Simulation
CSCE 668Set 16: Distributed Shared Memory
3
alg0
read/write return/ack
send recv
Message Passing System
algn-1
read/write return/ack
send recv
…
users of read/write shared memory
Shared Memory
Shared Memory Issues
CSCE 668Set 16: Distributed Shared Memory
4
A process invokes a shared memory operation (read or write) at some time
The simulation algorithm running on the same node executes some code, possibly involving exchanges of messages
Eventually the simulation algorithm informs the process of the result of the shared memory operation.
So shared memory operations are not instantaneous! Operations (invoked by different processes) can overlap
What values should be returned by operations that overlap other operations? defined by a memory consistency condition
Sequential Specifications
CSCE 668Set 16: Distributed Shared Memory
5
Each shared object has a sequential specification: specifies behavior of object in the absence of concurrency.
Object supports operations invocations matching responses
Set of sequences of operations that are legal
Sequential Spec for R/W Registers
CSCE 668Set 16: Distributed Shared Memory
6
Each operation has two parts, invocation and response
Read operation has invocation readi(X) and response returni(X,v) (subscript i indicates proc.)
Write operation has invocation writei(X,v) and response acki(X) (subscript i indicates proc.)
A sequence of operations is legal iff each read returns the value of the latest preceding write.
Ex: [write0(X,3) ack0(X)] [read1(X) return1(X,3)]
Memory Consistency Conditions
CSCE 668Set 16: Distributed Shared Memory
7
Consistency conditions tie together the sequential specification with what happens in the presence of concurrency.
We will study two well-known conditions: linearizability sequential consistency
We will only consider read/write registers, in the absence of failures.
Definition of Linearizability
CSCE 668Set 16: Distributed Shared Memory
8
Suppose is a sequence of invocations and responses for a set of operations. an invocation is not necessarily immediately followed
by its matching response, can have concurrent, overlapping ops
is linearizable if there exists a permutation of all the operations in (now each invocation is immediately followed by its matching response) s.t. |X is legal (satisfies sequential spec) for all vars X,
and if response of operation O1 occurs in before
invocation of operation O2, then O1 occurs in before O2 ( respects real-time order of non-overlapping operations in ).
Linearizability Examples
CSCE 668Set 16: Distributed Shared Memory
9
write(X,1) ack(X)
Suppose there are two shared variables, X and Y, both initially 0
read(Y) return(Y,1)
write(Y,1) ack(Y) read(X) return(X,1)
p0
p1
Is this sequence linearizable?Yes - brown triangles.
What if p1's read returns 0?
0
No - see arrow.
1
2
3
4
Definition of Sequential Consistency
CSCE 668Set 16: Distributed Shared Memory
10
Suppose is a sequence of invocations and responses for some set of operations.
is sequentially consistent if there exists a permutation of all the operations in s.t. |X is legal (satisfies sequential spec) for all
vars X, and if response of operation O1 occurs in before
invocation of operation O2 at the same process, then O1 occurs in before O2 ( respects real-time order of operations by the same process in ).
Sequential Consistency Examples
CSCE 668Set 16: Distributed Shared Memory
11
write(X,1) ack(X)
Suppose there are two shared variables, X and Y, both initially 0
read(Y) return(Y,1)
write(Y,1) ack(Y) read(X) return(X,0)
p0
p1
Is this sequence sequentially consistent?Yes - brown numbers.
What if p0's read returns 0?
0
No - see arrows.
1 2
3 4
Specification of Linearizable Shared Memory Comm. System
CSCE 668Set 16: Distributed Shared Memory
12
Inputs are invocations on the shared objects
Outputs are responses from the shared objects
A sequence is in the allowable set iff Correct Interaction: each proc. alternates
invocations and matching responses Liveness: each invocation has a matching
response Linearizability: is linearizable
Specification of Sequentially Consistent Shared Memory
CSCE 668Set 16: Distributed Shared Memory
13
Inputs are invocations on the shared objects Outputs are responses from the shared
objects A sequence is in the allowable set iff
Correct Interaction: each proc. alternates invocations and matching responses
Liveness: each invocation has a matching response
Sequential Consistency: is sequentially consistent
Algorithm to Implement Linearizable Shared Memory
CSCE 668Set 16: Distributed Shared Memory
14
Uses totally ordered broadcast as the underlying communication system.
Each proc keeps a replica for each shared variable
When read request arrives: send bcast msg containing request when own bcast msg arrives, return value in local replica
When write request arrives: send bcast msg containing request upon receipt, each proc updates its replica's value when own bcast msg arrives, respond with ack
The Simulation
CSCE 668Set 16: Distributed Shared Memory
15
alg0
read/write return/ack
to-bc-send to-bc-recv
Totally Ordered Broadcast
algn-1
read/write return/ack
to-bc-send to-bc-recv
…
users of read/write shared memory
Shared Memory
Correctness of Linearizability Algorithm
CSCE 668Set 16: Distributed Shared Memory
16
Consider any admissible execution of the algorithm in which underlying totally ordered broadcast
behaves properly users interact properly (alternate
invocations and responses Show that , the restriction of to the
events of the top interface, satisfies Liveness and Linearizability.
Correctness of Linearizability Algorithm
CSCE 668Set 16: Distributed Shared Memory
17
Liveness (every invocation has a response): By Liveness property of the underlying totally ordered broadcast.
Linearizability: Define the permutation of the operations to be the order in which the corresponding broadcasts are received. is legal: because all the operations are
consistently ordered by the TO bcast. respects real-time order of operations: if O1
finishes before O2 begins, O1's bcast is ordered before O2's bcast.
Why is Read Bcast Needed?
CSCE 668Set 16: Distributed Shared Memory
18
The bcast done for a read causes no changes to any replicas, just delays the response to the read.
Why is it needed? Let's see what happens if we remove it.
Why Read Bcast is Needed
CSCE 668Set 16: Distributed Shared Memory
19
write(1)
read return(1)
read return(0)
to-bc-send
p0
p1
p2
Algorithm for Sequential Consistency
CSCE 668Set 16: Distributed Shared Memory
20
The linearizability algorithm, without doing a bcast for reads:
Uses totally ordered broadcast as the underlying communication system.
Each proc keeps a replica for each shared variable
When read request arrives: immediately return the value stored in the local replica
When write request arrives: send bcast msg containing request upon receipt, each proc updates its replica's value when own bcast msg arrives, respond with ack
Correctness of SC Algorithm
CSCE 668Set 16: Distributed Shared Memory
21
Lemma (9.3): The local copies at each proc. take on all the values appearing in write operations, in the same order, which preserves the order of non-overlapping writes
- implies per-process order of writes is preserved
Lemma (9.4): If pi writes Y and later reads X, then pi's update of its local copy of Y (on behalf of that write) precedes its read of its local copy of X (on behalf of that read).
Correctness of the SC Algorithm
CSCE 668Set 16: Distributed Shared Memory
22
(Theorem 9.5) Why does SC hold? Given any admissible execution , must
come up with a permutation of the shared memory operations that is legal and respects per-proc. ordering of operations
The Permutation
CSCE 668Set 16: Distributed Shared Memory
23
Insert all writes into in their to-bcast order.
Consider each read R in in the order of invocation:
suppose R is a read by pi of X place R in immediately after the later
of the operation by pi that immediately
precedes R in , and the write that R "read from" (caused
the latest update of pi's local copy of X preceding the response for R)
Permutation Example
CSCE 668Set 16: Distributed Shared Memory
24
write(2)
read return(2)
read return(1)
to-bc-send
p0
p1
p2
ack
write(1) ack
to-bc-send
permutation is given by brown numbers
1
3
4
2
Permutation Respects Per Proc. Ordering
CSCE 668Set 16: Distributed Shared Memory
25
For a specific proc: Relative ordering of two writes is
preserved by Lemma 9.3 Relative ordering of two reads is
preserved by the construction of If write W precedes read R in exec. , then
W precedes R in by construction Suppose read R precedes write W in .
Show same is true in .
Permutation Respects Ordering
CSCE 668Set 16: Distributed Shared Memory
26 Suppose in contradiction R and W are
swapped in : There is a read R' by pi that equals or precedes R in There is a write W' that equals W or follows W in the
to-bcast order And R' "reads from" W'.
But: R' finishes before W starts in and updates are done to local replicas in to-bcast order
(Lemma 9.3) so update for W' does not precede update for W
so R' cannot read from W'.
R' R W|pi :
: …W … W' … R' … R …
Permutation is Legal
CSCE 668Set 16: Distributed Shared Memory
27
Consider some read R of X by pi and some write W s.t. R reads from W in .
Suppose in contradiction, some other write W' to X falls between W and R in :
Why does R follow W' in ?
: …W … W' … R …
Permutation is Legal
CSCE 668Set 16: Distributed Shared Memory
28
Case 1: W' is also by pi. Then R follows W' in because R follows W' in .
Update for W at pi precedes update for W' at pi in (Lemma 9.3).
Thus R does not read from W, contradiction.
Permutation is Legal
CSCE 668Set 16: Distributed Shared Memory
29
Case 2: W' is not by pi. Then R follows W' in due to some operation O, also by pi , s.t. O precedes R in , and O is placed between W' and R in
Consider the earliest such O.
Case 2.1: O is a write (not necessarily to X). update for W' at pi precedes update for O at pi in
(Lemma 9.3) update for O at pi precedes pi's local read for R in
(Lemma 9.4) So R does not read from W, contradiction.
: …W … W' … O … R …
Permutation is Legal
CSCE 668Set 16: Distributed Shared Memory
30
Case 2.2: O is a read.• By construction of , O must read X and in
fact read from W' (otherwise O would not be after W')
• Update for W at pi precedes update for W' at pi in (Lemma 9.3).
• Update for W' at pi precedes local read for O at pi in (otherwise O would not read from W').
• Thus R cannot read from W, contradiction.
: …W … W' … O … R …
Performance of SC Algorithm
CSCE 668Set 16: Distributed Shared Memory
31
Read operations are implemented "locally", without requiring any inter-process communication.
Thus reads can be viewed as "fast": time between invocation and response is only that needed for some local computation.
Time for a write is time for delivery of one totally ordered broadcast (depends on how to-bcast is implemented).
Alternative SC Algorithm
CSCE 668Set 16: Distributed Shared Memory
32
It is possible to have an algorithm that implements sequentially consistent shared memory on top of totally ordered broadcast that has reverse performance: writes are local/fast (even though bcasts are
sent, don't wait for them to be received) reads can require waiting for some bcasts to be
received Like the previous SC algorithm, this one
does not implement linearizable shared memory.
Time Complexity for DSM Algorithms
CSCE 668Set 16: Distributed Shared Memory
33
One complexity measure of interest for DSM algorithms is how long it takes for operations to complete.
The linearizability algorithm required D time for both reads and writes, where D is the maximum time for a totally-ordered broadcast message to be received.
The sequential consistency algorithm required D time for writes and 0 time for reads, since we are assuming time for local computation is negligible.
Can we do better? To answer this question, we need some kind of timing model.
Timing Model
CSCE 668Set 16: Distributed Shared Memory
34
Assume the underlying communication system is the point-to-point message passing system (not totally ordered broadcast).
Assume that every message has delay in the range [d-u,d].
Claim: Totally ordered broadcast can be implemented in this model so that D, the maximum time for delivery, is O(d).
Time and Clocks in Layered Model
CSCE 668Set 16: Distributed Shared Memory
35
Timed execution: associate an occurrence time with each node input event.
Times of other events are "inherited" from time of triggering node input recall assumption that local processing time is
negligible. Model hardware clocks as before: run at
same rate as real time, but not synchronized Notions of view, timed view, shifting are
same: Shifting Lemma still holds (relates h/w clocks and
msg delays between original and shifted execs)
Lower Bound for SC
CSCE 668Set 16: Distributed Shared Memory
36
Let Tread = worst-case time for a read to complete
Let Twrite = worst-case time for a write to complete
Theorem (9.7): In any simulation of sequentially consistent shared memory on top of point-to-point message passing, Tread + Twrite d.
SC Lower Bound Proof
CSCE 668Set 16: Distributed Shared Memory
37
Consider any SC simulation with Tread + Twrite < d. Let X and Y be two shared variables, both initially 0. Let 0 be admissible execution whose top layer
behavior is
write0(X,1) ack0(X) read0(Y) return0(Y,0) write begins at time 0, read ends before time d every msg has delay d
Why does 0 exist? The alg. must respond correctly to any sequence of
invocations. Suppose user at p0 wants to do a write, immediately
followed by a read. By SC, read must return 0. By assumption, total elapsed time is less than d.
SC Lower Bound Proof
CSCE 668Set 16: Distributed Shared Memory
38time 0 d
write(X,1) read(Y,0)p0
p1
0
SC Lower Bound Proof
CSCE 668Set 16: Distributed Shared Memory
39
Similarly, let 1 be admissible execution whose top layer behavior iswrite1(Y,1) ack1(Y) read1(X) return1(X,0) write begins at time 0, read ends before time
d every msg has delay d
1 exists for similar reason.
SC Lower Bound Proof
CSCE 668Set 16: Distributed Shared Memory
40time 0 d
write(X,1) read(Y,0)p0
p1
0
write(Y,1) read(X,0)
p0
p1
1
SC Lower Bound Proof
CSCE 668Set 16: Distributed Shared Memory
41
Now merge p0's timed view in 0 with p1's timed view in 1 to create admissible execution '.
But ' is not SC, contradiction!
SC Lower Bound Proof
CSCE 668Set 16: Distributed Shared Memory
42time 0 d
write(X,1) read(Y,0)p0
p1
0
write(Y,1) read(X,0)
p0
p1
1
write(X,1) read(Y,0)p0
p1
'
write(Y,1) read(X,0)
Linearizability Write Lower Bound
CSCE 668Set 16: Distributed Shared Memory
43
Theorem (9.8): In any simulation of linearizable shared memory on top of point-to-point message passing, Twrite ≥ u/2.
Proof: Consider any linearizable simulation with
Twrite < u/2. Let be an admissible exec. whose top layer
behavior is:p1 writes 1 to X, p2 writes 2 to X, p0 reads 2 from X
Shift to create admissible exec. in which p1 and p2's writes are swapped, causing p0's read to violate linearizability.
Linearizability Write Lower Bound
CSCE 668Set 16: Distributed Shared Memory
44
0 u/2 utime:
p0
p1
p2
write 1
read 2
write 2
:
p0
p1
p2
delaypattern
d - u/2
d - u/2
d - u/2 d - u/2
d d - u
Linearizability Write Lower Bound
CSCE 668Set 16: Distributed Shared Memory
45
0 u/2 utime:
p0
p1
p2
write 1
read 2
write 2
p0
p1
p2
delaypattern
d
d - u
d - u d
d- u d
shift p1
by u/2
shift p2
by -u/2
Linearizability Read Lower Bound
CSCE 668Set 16: Distributed Shared Memory
46
Approach is similar to the write lower bound. Assume in contradiction there is an
algorithm with Tread < u/4.
Identify a particular execution: fix a pattern of read and write invocations,
occurring at particular times fix the pattern of message delays
Shift this execution to get one that is still admissible but not linearizable
Linearizability Read Lower Bound
CSCE 668Set 16: Distributed Shared Memory
47
Original execution: p1 reads X and gets 0 (old value). Then p0 starts writing 1 to X. When write is done, p0 reads X and gets 1
(new value). Also, during the write, p0 and p1 alternate
reading X. At some point, the reads stop getting the old
value (0) and start getting the new value (1)
Linearizability Read Lower Bound
CSCE 668Set 16: Distributed Shared Memory
48
Set all delays in this execution to be d - u/2. Now shift p2 earlier by u/2. Verify that result is still admissible (every
delay either stays the same or becomes d or d - u).
But in shifted execution, sequence of values read is
0, 0, …, 0, 1, 0, 1, 1, …, 1
Linearizability Read Lower Bound
CSCE 668Set 16: Distributed Shared Memory
49
p0
p1
p2
read 0
read 1
read 0
read 1
read 1
read 1
read 1
read 0
write 1
u/2
p0
p1
read 0 read 0 read 1 read 1
p2
read 1read 1 read 1read 0
write 1