Consistency and Replication
Chapter 7
Most of the lecture notes are based on slides by Prof. Jalal Y. Kawash at Univ. of Calgary
Some notes are based on slides by Prof. Kenneth Chiu at SUNY Binghamton
I have modified them and added new slides
Giving credit where credit is due:
CSCE455/855 Distributed Operating Systems
Consistency and Replication
Chapter 7
Part I Consistency Models
Reasons for Replication
• Reliability:– Mask failures– Mask corrupted data
• Performance:– Scalability (size and geographical)
• Examples:– Web caching– Horizontal server distribution
Cost of Replication
• Replicas must be kept consistentDilemma:1. Replicate data for better performance2. Modification on one copy triggers modifications
on all other replicas3. Propagating each modification to each replica
can degrade performance?
Consistency Issues – Access/Update Ratio
time
…Updates to the Web page
User accesses to the page
Consistency Model
• When and how the modifications are made = consistency model:
– Weak versus strong consistency model
Consistency Models (cont.)
The general organization of a logical data store, physically distributed and replicated across multiple processes.
Consistency Models (cont)
• A process performs a read operation on a data item, expects the operation to return a value that shows the result of the last write operation on that data
• No global clock difficult to define the last write operation• Consistency models provide other definitions• Different consistency models have different restrictions on the values
that a read operation can return
read1 read2
Summary of Consistency Models
a) Consistency models not using synchronization operations.b) Models with synchronization operations.
Consistency Description
Strict Absolute time ordering of all shared accesses matters.
Sequential All processes see all shared accesses in the same order. Accesses are not ordered in time.
Causal All processes see causally-related shared accesses in the same order.
FIFO All processes see writes from each other in the order they were used. Writes from different processes may not always be seen in that order.
(a)
Consistency Description
Weak Shared data can be counted on to be consistent only after a synchronization is done
Release Shared data are made consistent when a critical region is exited
Entry Shared data pertaining to a critical region are made consistent when a critical region is entered.
(b)
Framework for Consistency Partial and Total Orders
Let S be a set, and R S S• R is anti-reflexive if x S, (x,x) R• R is transitive if x, y, z S, if (x,y) R
and (y,z) R then (x,z) R • A PO is an anti-reflexive, transitive relation• A PO is denoted by (S,R)• xRy means (x,y) R • A TO is a PO (S,R) such that x, y S x
y, either xRy or yRx
Framework for Consistency Operations and Data Items
• Operations are either writes or reads • A write is denoted wp(x)v• A read is denoted rp(x)v• A read-write data item is the set of all
sequences <o1, o2, … on> such that1. Each oi is either a read or a write2. Each read returns the same value written by the
most recent preceding write in the sequence
Framework for Consistency Operations and Processes
• Each operation can be decomposed into two components:
– Invocation and response• wp(x)v: invocation = wp(x)v; response = empty• rp(x)v: invocation = rp(x)?; response = v• A process is a sequence of operation
invocations• A process computation is a sequence of
operations obtained by augmenting each invocation in the process by its response
Framework for Consistency Multiprocess Systems
• A (multiprocess) system (P,D) is a set of processes, P, and a set of data items, D, such that all operation invocations of processes in P are applied to items in D
• A (multiprocess) system (P,D) computation is a collection of process computations one for each process in P
Framework for Consistency Example
Program p:x = y
Process p:r(y)v?w(x)v?
Program q:y = x
Process q:r(x)v?w(y)v?
Process p Comp:r(y)5w(x)5
Process q Comp:r(x)0w(y)0
System (P,D):P = {p,q}D = {x,y}
System (P,D) Computation:p: r(y)5 w(x)5q: r(x)0 w(y)0
Framework for Consistency Program Order
• Define program order, denoted (O, <po), by o1<po o2 iff o2 follows o1 in p’s computation
Process p:r(y)v?w(x)v?
Process q:r(x)v?w(y)v?
Process p Comp:r(y)5w(x)5
Process q Comp:r(x)0w(y)0
• rp(y)5 <po wp(x)5
• rq(x)0 <po wq(y)0• All of program order for
the example
Program p:x = y
Program q:y = x
Framework for Consistency Consistency Models
• A consistency model is a set of constraints on system computations
• A system computation of (P,D) satisfies a consistency model CM if the computation meets all the constraints in CM
Relation of Consistency Models• Sequential All processes see all shared accesses in the
same order
• Strict Absolute time ordering of all shared accesses matters.
• For two consistency models CM1 and CM2 CM1 is stronger than CM2 if the constraints of CM1 imply those of CM2
– CM2 is weaker than CM1– Sequential consistency is weaker than strict consistency
Framework for Consistency – Validity • Given a set of operations O • O|w indicates all the write operations in O• O|r indicates all the read operations in O• O|p is the subset of O containing p’s
operations, for some process p• O|x is the subset of O containing operations
on x, for some data item x• Let (O,<) be a total order of O• (O,<) is valid if for each data item x, the
subsequence (O|x,<) is valid for x
Framework for Consistency Valid Total Orders
Computation:p: w(x)5 r(y)5 q: r(x)0 w(y)5 r(x)5
Valid Total Order: rq(x)0 wq(y)5 wp(x)5 rq(x)5 rp(y)5
x and y are initially 0
Valid for x: rq(x)0 wq(y)5 wp(x)5 rq(x)5 rp(y)5
Valid for y: rq(x)0 wq(y)5 wp(x)5 rq(x)5 rp(y)5
Invalid Total Order: wp(x)5 rq(x)0 wq(y)5 rq(x)5 rp(y)5
Sequential Consistency (SC)
• Two constraints: – the result of any execution is the same as if the operations
of all the processes were executed in some sequential order, and
– the operations of each individual process appear in this sequence in the order specified by its program
• Let O be the set of all the operations of a computation C of a system (P,D). Then, C satisfies SC if there is a valid total order (O,<) such that (O,<po) (O,<)
SC – Intuition
process …
All Data Items (the set D)
process process
Switch
FIFO Channels
Sequential Consistency – Example
• C satisfies SC if there is a valid total order (O,<) such that (O,<po) (O,<)
C1p: w(x)1 r(x)2q: r(x)1 w(x)2
C1 satisfies SC(O,<) = <wp(x)1, rq(x)1, wq(x)2, rp(x)2>(O,<po) = { (wp(x)1, rp(x)2), (rq(x)1, wq(x)2) }
Sequential Consistency – Examples
C2p: w(x)1 r(x)2q: w(x)2 r(x)1
C2 does not satisfy SC(O, <po) = { (wp(x)1, rp(x)2), (wq(x)2, rq(x)1) }<wp(x)1, wq(x)2, rp(x)2, rq(x)1> (is not valid)<wp(x)1, rq(x)1, wq(x)2, rp(x)2> (violates PO)
C3p: w(x)1 w(y)2q: r(y)2 r(x)0
Exercise: Does C3 satisfy SC?(x and y are initially 0)
Coherence [Goodman]
• SC per data item
• Let O be the set of all the operations of a computation C of a system (P,D). Then, C satisfies Coherence if for each x D there is a valid total order (O|x,<x) such that (O|x,<po) (O|x,<x)
Coherence – Intuition
process …
OneData Item
process process
OneData Item
One Data Item…
FIFO Channels
Coherence – Examples
C1p: w(x)1 r(x)2q: r(x)1 w(x)2
C1 satisfies Coherence(O|x,<x) = <wp(x)1, rq(x)1, wq(x)2, rp(x)2>
C2p: w(x)1 r(x)2q: w(x)2 r(x)1
C2 does not satisfy Coherence
C3p: w(x)1 w(y)2q: r(y)2 r(x)0
C3 satisfies Coherence but not SC
C4p: w(x)3 w(x)2 r(y)3q: w(y)3 w(y)1 r(x)3
Does C4 satisfy Coherence? SC?
SC versus Coherence• If Computation C satisfies SC, then it
satisfies Coherence + PO• If a Computation C satisfies Coherence, then
it does not necessarily satisfy SC– Proof: Computation C3 is an example
All Computations satisfying consistency model CM = C(CM)
C(Coherence)
C(SC)
C3
FIFO [Lipton & Sandberg]
• Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes
• When would we like to use FIFO consistency model
Review: SC – Intuition
process …
All Data Items (the set D)
process process
Switch
FIFO Channels
Review: Coherence – Intuition
process …
OneData Item
process process
OneData Item
One Data Item…
FIFO Channels
FIFO – Intuition
process
All Data Items (D)
process
All Data Items (D)
process
All Data Items (D)
process
All Data Items (D)
FIFO Channels
FIFO [Lipton & Sandberg]
• Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes
• Let O be the set of all the operations of a computation C of a system (P,D). Then, C satisfies FIFO if for each p P there is a valid total order (O|p O|w,<p) such that (O|p O|w,<po) (O|p O|w,<p)
FIFO – Examples
C1p: w(x)1 r(x)2q: r(x)1 w(x)2
C1 satisfies FIFO (also SC and Coherence)(O|p O|w,<p) = <wp(x)1, wq(x)2, rp(x)2>(O|q O|w,<q) = <wp(x)1, rq(x)1, wq(x)2>
C2p: w(x)1 r(x)2q: w(x)2 r(x)1
C2 satisfies FIFO but not Coherence
C3p: w(x)1 w(y)2q: r(y)2 r(x)0
C3 satisfies Coherence but not SC nor FIFO
FIFO – Examples (cont)
C5p: w(x)3 w(x)1 w(y)2q: r(y)2 r(x)3
Does C4 satisfy FIFO? Coherence? SC?
Does C5 satisfy FIFO? Coherence? SC?
C4p: w(x)3 w(x)2 r(y)3q: w(y)3 w(y)1 r(x)3
SC versus FIFO• If Computation C satisfies SC, does it satisfy
FIFO?
• If Computation C satisfies FIFO, does it satisfy SC?
SC versus FIFO• If Computation C satisfies SC, then it
satisfies FIFO• If a Computation C satisfies FIFO, then it
does not necessarily satisfy SC– Proof: Computation C4 is an example
C(FIFO)
C(SC)
C4
C4p: w(x)3 w(x)2 r(y)3q: w(y)3 w(y)1 r(x)3
Coherence versus FIFO• If Computation C satisfies Coherence, does it satisfy
FIFO?
Coherence versus FIFO• If Computation C satisfies Coherence, then it does
not necessarily satisfy FIFO– Proof: Computation C5 is an example
C5p: w(x)3 w(x)1 w(y)2q: r(y)2 r(x)3
Coherence versus FIFO• If a Computation C satisfies FIFO, does it satisfy
Coherence?
Coherence versus FIFO• If a Computation C satisfies FIFO, then it does not
necessarily satisfy Coherence– Proof: Computation C2 is an example
C2p: w(x)1 r(x)2q: w(x)2 r(x)1
Coherence versus FIFO• There are computations that satisfy both Coherence
and FIFO, but not SC– Proof: Computation C4
C(Coherence) C(FIFO)C(SC)
C: satisfies FIFO and Coherence, but not SC
C4p: w(x)3 w(x)2 r(y)3q: w(y)3 w(y)1 r(x)3
Weak Consistency• Consider Critical Section
– If a process is in a critical section, its intermediate results of operations are not necessarily propagated to others.
• Idea– Enforce consistency on a Group of Operations– Limit the time when consistency holds– Let programmer explicitly specify this
Synchronization Operations
• In addition to reads and writes, introduce synchp() operation, which
– synchronizes all local copies of the data store• Propagate local updates• Bring in other’s updates
Weak Consistency (cont.)• Three conditions
1. No operation on a synchronization variable is allowed to be performed until all previous writes have completed everywhere.
2. No read or write operation on data items are allowed to be performed until all previous operations to synchronization variables have been performed.
3. Accesses to synchronization variables associated with a data store, are sequentially consistent.
Weak Consistency (cont.)
Weak Consistency Not Weak ConsistencyP1 P2 P3 P1 P2 P3
W(x, a)W(x, b)W(y, c)
S2S1
S3
bR(x)cR(y)
cR(y)bR(x)
W(x, a)W(x, b)W(y, c)
S2S1
S3
aR(x)cR(y)
bR(x)cR(y)
Or aR(x)NilR(y)bR(x)
No operation on a synchronization variable is allowed to be performed until all previous writes have completed everywhere.
No read or write operation on data items are allowed to be performed until all previous operations to synchronization
variables have been performed.
WC – Example
C6p: w(x)3 s() q: r(x)0 s() w(y)1 s’() r(x)3m: w(x)5 r(y)1 s() r(x)3
• All of p, q, and m must agree on a total order of synch operations consistent with program order; for example:
<sq(), sp(), s’q(), sm()>• (O|p O|w O|s , <p) =
< wm(x)5, sq(), wp(x)3, sp(), wq(y)1, s’q(), sm() >• (O|q O|w O|s, <q) = < rq(x)0, wm(x)5, sq(), wp(x)3, sp(), wq(y)1, s’q(), sm(), rq(x)3>• (O|m O|w O|s, <m) = < wm(x)5, sq(), wp(x)3, sp(), wq(y)1, s’q(), rm(y)1, sm(), rm(x)3 >
Accesses to synchronization variables associated with a data store, are sequentially consistent.
Summary of Consistency Models
a) Consistency models not using synchronization operations.b) Models with synchronization operations.
Consistency Description
Strict Absolute time ordering of all shared accesses matters.
Linearizability All processes must see all shared accesses in the same order. Accesses are furthermore ordered according to a (nonunique) global timestamp
Sequential All processes see all shared accesses in the same order. Accesses are not ordered in time
Causal All processes see causally-related shared accesses in the same order.
FIFO All processes see writes from each other in the order they were used. Writes from different processes may not always be seen in that order
(a)
Consistency Description
Weak Shared data can be counted on to be consistent only after a synchronization is done
Release Shared data are made consistent when a critical region is exited
Entry Shared data pertaining to a critical region are made consistent when a critical region is entered.
(b)
Weaker Models• Sometimes strong models are needed, if the result of
race conditions are very bad.– Banks
• Sometimes the result of races are just inefficiency, or inconvenience, etc.
• How strong is Orbitz’s model?– If it shows that a flight ticket with a certain price is
available, is it really?• One kind of weaker model is eventual consistency
– It eventually becomes consistent
Lazy Consistency Models• When updates are scarce• When updates are not conflicting
– Examples: DNS and WWW• Eventual Consistency (EC): Lazy propagation
of updates to all replicas– If no updates take place for a long time, all replicas
will become consistent– Cheap to implement– If a client always accesses the same replica, the
same or newer data will be read as time passes. EC works.
Eventual Consistency
• How well does EC work for mobile clients?• Client-centric is for this. Consistent for a single client.
Notation
• xi[t] is the version of x at local copy Li at time t.• Version xi[t] is the result of a series of write
operations at Li that took place since initialization. This is WS(xi[t]).
• If operations in WS(xi[t]) have also been performed at local copy Lj at a later time t2, we write WS(xi[t1];xj[t2]).
Monotonic Reads (1)• A data store is said to provide monotonic-read
consistency:– If a process reads the value of a data item x any
successive read operation on x by that process will always return that same value or a more recent value.
Monotonic Reads (2)
• A data store that provides monotonic reads consistency
• A data store that does not
Monotonic Writes (1)
• In a monotonic-write consistent store, the following condition holds:–A write operation by a process on a data item x is
completed before any successive write operation on x by the same process.
Monotonic Writes (2)
• A data store that provides monotonic writes consistency
• A data store that does not
Read Your Writes (1)• A data store is said to provide read-your-writes
consistency:– If the effect of a write operation by a process on data
item x will always be seen by a successive read operation on x by the same process.
• Suppose your web browser has a cache.– You update your web page on the server.– Do you have read-your-writes consistency?
Read Your Writes (2)
• A data store that provides read your writes consistency
• A data store that does not
Writes Follow Reads (1)• A data store is said to provide writes-follow-reads
consistency:– If a write operation by a process on a data item x
following a previous read operation on x by the same process is guaranteed to take place on the same or a more recent value of x that was read.
Writes Follow Reads (2)
• A data store that provides writes follow reads consistency
• A data store that does not
Consistency and Replication
Chapter 7
Part II Replica Management
& Consistency Protocols
Replica Management
• Replica-Server Placement• Replica Placement:
– Where is a replica placed?– When is a replica created?– Who creates the replica?
• How do we distribute updates between replicas?
– Update propagation
Types of Replicas
Permanent Replicas
• Initial set of replicas– Other replicas can be created from them– Small and static set
• Example: Web site horizontal distribution1. Replicate Web site on a limited number of
machines on a LAN– Distribute request in round-robin
2. Replicate Web site on a limited number of machines on a WAN (mirroring)– Clients choose which sites to talk to
Server-Initiated Replicas (1)
• Dynamically created at the request of the owner of the DS• Example: push-based caches
– Owners: web owners for CNN, Yahoo– Web hosting services: that provided by Akamai– Web hosting servers can dynamically create replicas close to the
demanding client• Need dynamic policy to create, migrate and delete replicas
Server-Initiated Replicas (2)
• One is to keep track of Web page hits– Keep a counter and access-origin list for each page
F Server Q
Server P
Server-Initiated Replicas (3)
• Count access requests from different clients: cntQ(P, F), if it is more than ½ * requests for F at Q, then Q attempts to migrate F to P.
Client-Initiated Replicas
• These are caches– Temporary storage (expire fast)
• Managed by clients• Cache hit: return data from cache• Cache miss: load copy from original server• Kept on the client machine, or on the same
LAN• Multi-level caches
Multi-Level Caches
Design Issues for Update Propagation
• A process modified a replica. • Based on the consistency model supported, the
update is then propagated to all other replicas at its proper time
• How to carry out update propagation and make two replicas consistent with each other?
1. Propagate state or operation2. Pull or Push protocols3. Unicast or multicast propagation
State versus Operation Propagation (1)
1. Propagate a notification of update• Invalidate protocols
• When data item x is changed at a replica, it is invalidated at other replicas
• An attempt to read the item causes an “item-fault” triggering updating the local copy before the read can complete
• Uses little network bandwidth• When is good to use this distribution protocol?
• When read-to-write ratio is low or high?• Good when read-to-write ratio is low
State versus Operation Propagation (2)
When read-to-write ratio is high: 2. Transfer modified data or a log of changes
• High network bandwidth usage3. Propagate the update operation
• Each replica must have a process capable of performing the update
• Very low network bandwidth usage• Other trade-off between this two protocols?
Pull versus Push• Push: updates are propagated to other
replicas without solicitation– Typically, from permanent to server-initiated
replicas– Used to achieve a high degree of consistency
• Pull: A replica asks another for an update– Typically, from client-initiated replica– Inconsistent cache results in longer response time
Pull versus Push Protocols
A comparison between push-based and pull-based protocols in the case of multiple client, single server systems.
Issue Push-based Pull-based
State of server List of client replicas and caches None
Messages sent Update (and possibly fetch update later) Poll and update
Response time at client Immediate (or fetch-update time) Fetch-update time
76
Leases• Combined push and pull• A server promise
– push updates for a certain time– a lease expires => the client
• polls the server or requests a new lease– length of a lease?– different types of leases
• age based: {time to last modification}• renewal-frequency based: long-lasting leases to active users• state-space overhead: increasing utilization of a server =>
lower expiration times of new leases
Unicast versus Multicast• Unicast: a replica sends separate n-1 updates
to every other replica
• Multicast: a replica sends one update to multiple replicas
– Network takes care of multicasting– Can be cheaper– Suits push-based protocols
Consistency Protocols
• Actual implementation of consistency models• For instance, how to implement sequential
consistency?
• Whether primary copy exists or not– Primary-based protocols– Replicated-write protocols
Primary-Based Remote-Write Protocol
• Primary copy and backups for each data item• Read from local copy• Write to the (remote) primary server
– Update backups• Blocking vs. non-blocking update
Primary-Based Remote-Write Protocol (cont.)
The principle of primary-backup protocol.
Sequential Consistency
process …
All Data Items (the set D)
process process
Switch
FIFO Channels
Primary-Based Remote-Write Protocol (cont.)
The principle of primary-backup protocol.
Sequential consistencyRead Your Writes
Primary-Based Remote-Write Protocol (cont.)
The principle of primary-backup protocol.
Sequential consistencyRead Your Writes
X W3’
Sequential consistencyRead Your Writes W3’ X
Primary-Based Local Write Protocol
• Primary copy and backups for each data item• Read from local copy• Move primary copy to local server and write to it• Update backups
Primary-Based Local-Write Protocol (cont.)
Primary-backup protocol in which the primary migrates to the process wanting to perform an update.
Example: Mobile PC primary server for items to be needed
Sequential consistencyRead Your Writes
• Which consistency protocol does DNS (Domain Name System) follows?
No Primary Copy Replicated-Write Protocols
• Active replication
• Quorum-based protocol
Ordering Guarantees
• Updates sent from different processes may be delivered in different orders at different sites
• Totally-ordered multicast
• Causally-ordered multicast
Ordering Guarantees
• The sequencer approach– All requests must be sent to a sequencer, where
they are given an identifier– The sequencer assigns consecutive increasing
identifiers as it receives requests– Requests arriving at sites are held back until
they are next in sequence
Active Replication
The problem of replicated invocations.
Active Replication (cont.)
Forwarding an invocation request from a replicated object.Returning a reply to a replicated object.
Network Partitions• Primary-based
– Remote-write protocol– Local-write protocol
• Replicated-write protocol– Totally-ordered multicast approach– Sequencer approach
• None of them work if network is partitioned!
Network Partitions
• Idea?– Use Majority
• Write• Read
• Read– Retrieve number of replicas in read quorum– Select the one with the latest version.– Perform a read on it
• Write– Retrieve number of replicas in write quorum.– Find the latest version and increment it.– Perform a write on the entire write quorum.
Well-known Solution: Quorum-Based Protocols
Quorum-Based Protocols
• N: Total #Replicas• NR: #Replicas in Read Quorum• NW: #Replicas in Write Quorum • Constraints:
1. NR + NW > N2. NW > N/2
Quorum-Based Protocols
Three examples of the voting algorithm for N = 12 replicas(a) A correct choice of read and write set(b) A choice that may lead to write-write conflicts(c) A correct choice, known as ROWA (read one, write all)