1 Principles of Reliable Distributed Systems Lecture 10: Atomic Shared Memory Objects and Shared...

transcript

Principles of Reliable Distributed Systems

Lecture 10: Atomic Shared

Memory Objects and Shared Memory Emulations

Spring 2007

Prof. Idit Keidar

Material

• Attiya and Welch, Distributed Computing– Ch. 9 & 10

• Nancy Lynch, Distributed Algorithms– Ch. 13 & 17

• Linearizability slides adapted from Maurice Herlihy

Shared Memory Model

• All communication through shared memory!– No message-passing.

• Shared memory registers/objects.

• Accessed by processes with ids 1,2,…

• Note: we have two types of entities: objects and processes.

Motivation

• Multiprocessor architectures with shared memory• Multi-threaded programs• Distributed shared memory (DSM)• Abstraction for message passing systems

– We will see how to emulate shared memory in message passing systems.

– We will see how to use shared memory for consensus and state machine replication.

LinearizabilitySemantics for Concurrent

Objects

FIFO Queue: Enqueue Method

q.enq( )

Process

FIFO Queue: Dequeue Method

q.deq()/

Process

Sequential Objects

• Each object has a state– Usually given by a set of fields– Queue example: sequence of items

• Each object has a set of methods– Only way to manipulate state– Queue example: enq and deq methods

Methods Take Time

Method call

invocation 12:00

q.enq(...

response 12:01

Split Method Calls into Two Events

• Invocation– method name & args– q.enq(x)

• Response– result or exception– q.enq(x) returns void– q.deq() returns x– q.deq() throws empty

A Single Process (Thread)

• Sequence of events

• First event is an invocation

• Alternates matching invocations and responses

• This is called a well-formed interaction

Concurrent Methods Take Overlapping Time

Method call Method call

Method call

Concurrent Objects

• What does it mean for a concurrent object to be correct?

• What is a concurrent FIFO queue?– FIFO means strict temporal order– Concurrent means ambiguous temporal order

• Help!

Sequential Specifications

• Precondition, say for q.deq(…)– Queue is non-empty

• Postcondition:– Returns & removes first item in queue

• You got a problem with that?

Concurrent Specifications

• Naïve approach– Object has n methods– Must specify O(n2) possible interactions– Maybe more

If the quque is empty and then enq begins and deq begins after enq(x) begins but before enq(x) ends then …

• Linearizability: same as it ever was

Linearizability

• Each method should –– “Take effect”

• Effect defined by the sequential specification

– Instantaneously• Take 0 time

– Between its invocation and response events.

Linearization

• A linearization of a concurrent execution is– A sequential execution

• Each invocation is immediately followed by its response

• Satisfies the object’s sequential specification

– Looks like • Responses to all invocations are the same as in

– Preserves real-time order• Each invocation-response pair occurs between the

corresponding invocation and response in

Linearizability and Atomicity

• A concurrent execution that has a linearization is linearizable.

• An object that has only linearizable executions is atomic.

Why Linearizability?

• “Religion”, not science

• Scientific justification:– Facilitates reasoning– Nice mathematic properties

• Common-sense justification– Preserves real-time order– Matches my intuition (sorry about yours)

Example

q.enq(x)

q.enq(y) q.deq(x)

q.deq(y)

Example

q.enq(x)

q.enq(y)

q.deq(y)

Example

q.enq(x)

q.deq(x)

Example

q.enq(x)

q.enq(y)

q.deq(y)

q.deq(x)

Read/Write Variable Example

read(1)write(0)

write(1)

read(0)

read(1)write(0)

write(1)

write(2)

read(1)

read(1)write(0)

write(1)

write(2)

read(2)

Concurrency

• How much concurrency does linearizability allow?

• When must a method invocation block?

• Focus on total methods– defined in every state– why?

Concurrency

• Question: when does linearizability require a method invocation to block?

• Answer: never

• Linearizability is non-blocking

Non-Blocking Theorem

If method invocationA q.invoc()

is pending in linearizable history H, then there exists a responseA q:resp()

such thatH + A q:resp()

is legal

Note on Non-Blocking

• A given implementation of linearizability may be blocking

• The property itself does not mandate it– for every pending invocation, there is always a

possible return value that does not violate linearizability

– the implementation may not always know it…

Atomic Objects

• An object is atomic if all of its concurrent executions are linearizable

• What if we want an atomic operation on multiple objects?

Serializability

• A transaction is a finite sequence of method calls

• A history is serializable if – transactions appear to execute serially

• Strictly serializable if– order is compatible with real-time

• Used in databases

Serializability is Blocking

x.read(0)

y.read(0) x.write(1)

y.write(1)

Comparison

• Serializability appropriate for– fault-tolerance– multi-step transactions

• Linearizability appropriate for– single objects– multiprocessor synchronization

Critical Sections

• Easy way to implement linearizability– take sequential object– make each method a critical section

• Like synchronized methods in Java™

• Problems?– Blocking– No concurrency

Linearizability Summary

• Linearizability– Operation takes effect instantaneously between

invocation and response

• Uses sequential specification– No O(n2) interactions

• Non-Blocking– Never required to pause method call

• Granularity matters

Atomic Register Emulation in a Message-Passing System

[Attiya, Bar-Noy, Dolev]

Distributed Shared Memory (DSM)

• Can we provide the illusion of atomic shared-memory registers in a message-passing system?

• In an asynchronous system?

• Where processes can fail?

Liveness Requirement

• Wait-freedom (wait-free termination): every operation by a correct process p completes in a finite number of p’s steps

• Regardless of steps taken by other processes– In particular, the other processes may fail

or take any number of steps between p’s steps

– But p must be given a chance to take as many steps as it needs. (Fairness).

Register

• Holds a value

• Can be read

• Can be written

• Interface: – int read(); /* returns a value */

– void write(int v); /* returns ack */

Take I: Failure-Free Case

• Each process keeps a local copy of the register

• Let’s try state machine replication– Step1: Implement atomic broadcast (how?)

• Recall: atomic broadcast service interface:– broadcast(m)– deliver(m)

Emulation with Atomic Broadcast (Failure-Free)

• Upon client request (read/write),– Broadcast the request

• Upon deliver write request – Write to local copy of register– If from local client, return ack to client

• Upon deliver read request– If from local client, return local register value to client

• Homework questions: – Show that the emulated register is atomic– Is broadcasting reads required for atomicity?

What If Processes Can Crash?

• Does the same solution work?

ABD: Fault-Tolerant Emulation[Attiya, Bar-Noy, Dolev]

• Assumes up to f<n/2 processes can fail

• Main ideas: – Store value at majority of processes before

write completes

– read from majority

– read intersects write, hence sees latest value

Take II: 1-Reader 1-Writer (SRSW)

• Single-reader – there is only one process that can read from the register

• Single-writer – there is only one process that can write to the register

• The reader and writer are just 2 processes;– The other n-2 processes are there to help

Trivial Solution?

• Writer simply sends message to reader – When does it return ack?– What about failures?

• We want a wait-free solution: – if the reader (writer) fails, the writer (reader)

should be able to continue writing (reading)

SRSW Algorithm: Variables

• At each process:– x, a copy of the register– t, initially 0, unique tag associated with latest

SRSW AlgorithmEmulating Write

• To perform write(x,v)– choose tag > t– set x ← v; t ← tag– send (“write”, v, t) to all

• Upon receive (“write”, v, tag) – if (tag > t) then set x ← v; t ← tag fi– send (“ack”, v, tag) to writer

• When writer receives (“ack”, v, t) from majority (counting an ack from itslef too)– return ack to client

SRSW AlgorithmEmulating Read

• To perform read(x,v)– send (“read”) to all

• Upon receive (“read”) – send (“read-ack”, x, t) to reader

• When reader receives (“read-ack”, v, tag) from majority (including local values of x and t)– choose value v associated with largest tag– store these values in x,t– return x

Does This Work?

• Only possible overlap is between read and write– why?

• When a read does not overlap any write –– it reads at least one copy that was written by the latest

write (why?)– this copy has the highest tag (why?)

• What is the linearization order when there is overlap?

• What if 2 reads overlap the same write?

Example

read(1) read(?)

write(1)

Wait-Freedom

• Only waiting is for majority of responses

• There is a correct majority

• All correct processes respond to all requests– Respond even if the tag is smaller

Take III: n-Reader 1-Writer (MRSW)

• n-reader – all the processes can read

• Does the previous solution work?

• What if 2 reads by different processes overlap the same write?

Example

read(1)

read(?)

write(1)

MRSW Algorithm Extending the Read

• When reader receives (“read-ack”, v, tag) from majority – choose value v associated with largest tag– store these values in x,t– send (“propagate”, x, t) to all (except writer)

• Upon receive (“propagate”, v, tag) from process i– if (tag > t) then set x ← v; t ← tag fi– send (“prop-ack”, x, t) to process i

• When reader receives (“prop-ack”, v, tag) from majority (including itself)– return x

The Complete Read

S1S1 S1

(“read”) (“read-ack”,v, t)

Phase 1 Phase 2 : Multi-reader only

read() return

(“propagate”, v, t)(“prop-ack”)

Take IV: n-Reader n-Writer (MRMW)

• n-writer – all the processes can write to the register

• Does the previous solution work?

Playing Tag

• What if two writers use the same tag for writing different values?

• Need to ensure unique tags– That’s easy: break ties, e.g., by process id

• What if a later write uses a smaller tag than an earlier one?– Must be prevented (why?)

MRMW Algorithm Extending the Write

• To perform write(x,v)– send (“query”) to all

• Upon receive (“query”) from i– send (“query-ack”, t) to i

• When writer receives (“query-ack”, tag) from majority (counting its own tag)– choose unique tag > all received tags– continue as in 1-writer algorithm

• What if another writer chooses a higher tag before write completes?

The Complete Write

S1S1 S1

(“query”) (“query-ack”, t)

Phase 1: Multi-writer only Phase 2

write(v) ack

(“write”, v, t) (“ack”)

How Long Does it Take?

• The write emulation– Single-writer: 2 rounds (steps)– Multi-writer: 4 rounds (steps)

• The read emulation– Single-reader: 2 rounds (steps)– Multi-reader: 4 rounds (steps)

What if A Majority Can Fail?

• You guessed it!

• Homework question

Can We Emulate Every Atomic Object the Same Way?

Difference from Consensus

• Works even if the system is completely asynchronous

• In Paxos, there is no progress when there are multiple leaders

• Here, there is always progress – multiple writers can write concurrently– One will prevail (Which?)

1 Principles of Reliable Distributed Systems Lecture 10: Atomic Shared Memory Objects and Shared...

Documents