Consistency and Replication (3)
Topics
Consistency protocols
Readings
Van Steen and Tanenbaum: 6.5Coulouris: 11,14
Introduction
A consistency protocol describes an implementation of a specific consistency model.
We will look at different architectures that can be used to support different consistency models, but first we look at a basic architectural model.
A Basic Architectural Model for the Management of Replicated Data
FE
Requests andreplies
C
ReplicaC
ServiceClients Front ends
managers
RM
RMFE
RM
A Basic Architectural Model for the Management of Replicated Data
A collection of replica managers provides a service to clients.
The clients see a service that gives them access to objects (e.g., calendar or bank accounts) which are replicated.
Each client’s requests are handled by a component called a front end.
A Basic Architectural Model for the Management of Replicated Data
The purpose of the front end is to hide the replication from the client process.
The client processes do not know how many replicas there are.
A front end may be implemented in the client’s address space or it may be a separate process.
Replicas coordinate in preparation to execute the request consistently.
A Basic Architectural Model for the Management of Replicated Data
Replica managers execute requestsOne or more replicas may respond to the
application (through the front end).
Primary-Based Protocols
In primary-based protocols, each data item x in the data store has an associated primary, which is responsible for coordinating write operations on x.
Primary-Backup Protocols Read operations are performed on a locally available copy. Write operations are done at a fixed primary copy. The primary performs the update on its local copy of x and
then forwards the update to all the other replicas (which are considered to be backups).
Primary-Based Protocols
Primary-Backup Protocols (cont) Each backup server performs the update as well and sends an
acknowledgement back to the primary. When all backup servers have updated their local copy the primary
sends an acknowledgement back to the initial process.
This implements sequential consistency. The primary RM is a performance bottleneck Can tolerate F failures for F+1 RMs SUN NIS (yellow pages) uses passive replication:
client can contact primary or backup servers for reads, but only primary servers for updates.
The Primary-Backup Protocol
FEC
FEC
RM
Primary
Backup
Backup
RM
RM
Replicated-Write Protocols
In replicated-write protocols, write operations can be carried out at multiple replicas instead of only one (as seen in the case of primary-based replicas).
Operations need to be carried out in the same order everywhere.
We discussed one approach for doing so that uses Lamport’s timestamps.
Using Lamport timestamps does not scale well in large distributed systems.
Replicated-Write Protocols
An alternative approach to achieving total order is to use a central coordinator which is sometimes called a sequencer. Forward each operation to the sequencer. Sequencer assigns a unique sequence number and
subsequently forwards the operation to all replicas. Operations are carried out in the order of their sequence
number. Hmm. This resembles primary-based consistency
protocols.
Useful for sequential consistency.
Replicated-Write Protocols
The use of a sequencer does not solve the scalability problem.
A combination of Lamport timestamps and sequencers may be necessary.
The approach is summarized as follows: Each process has a unique identifier, pi, and keeps a sent
message counter ci. The process identifier and message counter uniquely identify a message.
Active processes (or a sequencer) keep an extra counter: ti. This is called the ticket number. A ticket is a triplet (pi, ti, (pj, cj)).
Replicated-Write Protocols Approach Summary (cont)
An active process issues tickets for its own messages and for messages from its associated passive processes (these are processes that are not sequencers).
Passive processes multicast their messages to all group processes which then wait for a ticket stating the total order of each message.
The ticket is sent by each passive process’s sequencer. Lamport’s totally ordered multicast algorithm is used among the
sequencers to determine the order of update operations. When an operation is allowed, each sequencer sends the ticket to its
associated passive processes. It is assumed that the passive process receives these tickets in the order sent.
Replicated-Write Protocols
Approach Summary (cont) If a sequencer terminates abnormally, then one of the
passive sequencers associated with it can become the new sequencer.
An election algorithm may be used to choose the new sequencer.
Replicated-Write Protocols
Let’s say that we have 6 processes: p1,p2,p3,p4,p5,p6
Assume that p1,p2 are sequencers; p3,p4 are associated with p1 and p5,p6 are associated with p2
Let’s say that p3 sends a message which is identified by (p3 , 1).
p1 generates a ticket as follows: (p1, 1, (p3 , 1))
The ticket number is generated using the Lamport clock algorithm.
Replicated-Write Protocols
Let’s say that p5 sends a message which is identified by (p5 , 1).
p2 generates a ticket as follows: (p2, 1, (p3 , 1))
Which update gets done first? Basically, p1,p2
will apply Lamport’s algorithm for totally ordered multicast.
When an update operation is allowed to proceed, the sequencers send messages to their associated processes.
Gossip Architecture
We just studied some architectures for sequential consistency. What about causal consistency?
The Gossip Architecture supports causally-consistent lazy replication which in essence refers to the potential causality between read and write operations.
Clients are allowed to communicate with each other, but will then have to exchange information on the operations they performed on the data store. This exchange of information is done through gossip messages.
Gossip Architecture
Gossip Architecture
Each RMi maintains for its local copy the vector timestamp VAL(i) VAL(i)[i]: the total number of completed write requests
that have been sent from a client to RMi VAL(i)[j]: the total number of completed write requests
that have been sent from RMj to RMi This is referred to as the value timestamp and it reflects the
updates that have been completed at the replica. This timestamp is attached to the reply of a read operation.
Gossip Architecture Each RMi maintains for its local copy the vector
timestamps WORK(i) which represents those write operations that been been received (but not necessarily processed) at RMi WORK(i)[i]: the total number of write requests that have
been sent from a client to RMi including those that have been completed by RMi.
WORK(i)[j]: the total number of write requests that have been sent from RMj to RMi including those that have been completed by RMi.
This is referred to as the replica timestamp. This timestamp is attached to the reply of a write operation.
Gossip Architecture Each client keeps track of the writes that it has seen so
far. The client C maintains a vector timestamp LOCAL(C) with LOCAL (C )[i] set equal to the most recent value of the number of writes seen at RMi (from C’s view point).
This vector timestamp is attached to every request sent to a replica.
Note that the client can contact a different replica each time it wants to read or write data.
Two front ends may exchange messages directly; these messages also carry the timestamp represented by LOCAL (C).
Gossip Architecture
Write log (queue) Every write operation, when received by a replica, is
recorded in the update log of the replica. Two reasons for this:
The update cannot be applied yet; it is held back It is uncertain if the update has been received by all
replicas. The entries are sorted by timestamp.
A similar log is needed for read operations. This is referred to as the read log (or queue).
Gossip Architecture
The Executed Operation table The same write operation may arrive at a replica from a
front end and in a gossip message from another replica. To present an update from being applied twice, the replica
keeps a list of identifiers of the write operations that have been applied so far.
Gossip Architecture
Processing read request R from C Let DEP (R) be the timestamp associated with R. It is set to
LOCAL(C). The request is sent to RMi (with DEP (R)) which stores the request in
its read queue. The read request is processed if DEP(R)[j] <= VAL(i)[j] (for all j).
This indicates that RMi has seen the same writes as the client.
As soon as a read operation can be carried out, RMi returns the value of the requested data item to the client, along with VAL(i).
LOCAL(C) is adjusted to the value max{LOCAL(C)[j],VAL(i)[j]} for all j.
This make sense since the value returned by read is potentially the cumulative result of all previous writes.
Gossip Architecture
Performing a read operation at a local copy.
Gossip Architecture Processing a write operation, W, from C
Let DEP (W) be the timestamp associated with W. It is set to LOCAL(C) .
When the request is received by RMi it increments WORK(i)[i] by 1 but leaves the other entries intact.
This is done so that WORK reflects that RMi has received the latest write request. At this point it isn’t known if it can be carried out.
A timestamp ts(W) is derived from DEP(W) by setting ts(W)[i] to WORK(i)[i]; the rest of entries are as found in DEP(W).
This timestamp is sent back as an acknowledgement to the client, which subsequently adjusts LOCAL(C) by setting each kth entry to max{LOCAL(C)[k],ts(W)[k]}.
Gossip Architecture
Processing Write Operations (cont) The write request W is processed if DEP(W)[j] <=
VAL(i)[j] (for all j). This indicates that RMi has seen the same writes as
the client. This is referred to as the stability condition.
The write operation takes place. What if there exists a j such that DEP(W)[j] >
VAL(i)[j]? This would indicate that there was a write seen
by the client that is not yet seen by RMi.
Gossip Architecture Processing Write Operations(cont)
VAL(i) is adjusted by setting each jth entry to max{VAL(i)[j],ts(W)[j]}.
Recall that ts(W)[j] is set to DEP(W)[j] for all j != i and is set to WORK(i)[i] for j = i(which had been incremented upon receiving the write request; the end result is that VAL(i) is incremented by 1).
The following two conditions are satisfied: All operations sent directly to RMi from other clients but that preceded
W, have been processed. ts(W)[i] = VAL(i)[i] + 1
All write operations that W depends on have been processed. ts(W)[j] <= VAL(i)[j] for all j != i
Gossip Architecture
Performing a write operation at a local copy.
Gossip Architecture For every gossip message received by RMj from RMi,
does the following: RMj adjusts WORK(j) by setting each kth entry equal to
max{WORK(i)[k],WORK(j)[k]} RMj merges the write operations sent by RMi with its own Apply those writes that have become stable i.e., a write
request W is processed if DEP(W)[j] <= VAL(i)[j] (for all j). A write from RMj that is processed should cause VAL(i)[j] to be incremented by 1.
A gossip message need not contain the entire log, if it is certain that some of the updates have been seen by the receiving replica.
Gossip Architecture (Example)
VAL = (0,0,0)WORK=(0,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (0,0,0)
LOCAL = (0,0,0)
Initial state
VAL = (0,0,0)WORK=(0,0,0)
0
1
Gossip Architecture (Example)
VAL = (0,0,0)WORK=(0,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (0,0,0)
LOCAL = (0,0,0)
Client 0 sends a write, W0, to replica 0
VAL = (0,0,0)WORK=(0,0,0)
0
1
DEP(W0)=(0,0,0)
Gossip Architecture (Example)
VAL = (0,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (0,0,0)
LOCAL = (0,0,0)
WORK is updated
VAL = (0,0,0)WORK=(0,0,0)
0
1
Gossip Architecture (Example)
VAL = (0,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,0)
LOCAL = (0,0,0)
client 0 receives an ack from replica 0 for its writeLOCAL changes from (0,0,0) to (1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
1
ack (ts(W0))
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,0)
LOCAL = (0,0,0)
W0 is applied since DEP(W0) <= VAL; VAL changes
VAL = (0,0,0)WORK=(0,0,0)
0
1
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,0)
LOCAL = (0,0,1)
Represents state after Client 1 sends a write,W1, to replica 2
VAL = (0,0,1)WORK=(0,0,1)DEP(W1)=(0,0,0)ts(W1)=(0,0,1)
0
1
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,0)
LOCAL = (0,0,1)
Client 0 sends a write message W2 to replica 2;Cannot be done yet since replica 2 didn’t see the write done at replica 1
VAL = (0,0,1)WORK=(0,0,2)DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
DEP(W2)=(1,0,0)
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,2)
LOCAL = (0,0,1)
An ack has been returned to 0 which then updates LOCALfrom (1,0,0) to (1,0,2)
VAL = (0,0,1)WORK=(0,0,2)DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
ack(ts(W2))
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,2)
LOCAL = (0,0,1)
Replica 0 and 2 exchange update propagation messages (gossip)WORK at both replicas is adjusted
VAL = (0,0,1)WORK=(1,0,2)DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,2)
LOCAL = (0,0,1)
Replica 0 has one write operation (W0). This is sent to replica 2 withDEP(W0). Replica 2 has write operation(W1). This is sent to replica 2 with DEP(W1). Replica 2 also sends W2 with DEP(W2)
VAL = (0,0,1)WORK=(1,0,2) DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
Gossip Architecture (Example)
VAL = (1,0,0)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,2)
LOCAL = (0,0,1)
VAL = (0,0,1)WORK=(1,0,2) DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
Replica 2 can carry out W0 since DEP(W0) < VAL Replica 0 can carry out W1 since DEP(W1) <= VAL
Gossip Architecture (Example)
VAL = (1,0,1)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,2)
LOCAL = (0,0,1)
VAL = (1,0,1)WORK=(1,0,2) DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
VAL in replica 0 and replica 2 are updated
Gossip Architecture (Example)
VAL = (1,0,1)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)
VAL = (0,0,0)WORK=(0,0,0)
0
2
1
replicasLOCAL = (1,0,2)
LOCAL = (0,0,1)
VAL = (1,0,1)WORK=(1,0,2) DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)
0
1
W2 can now be executed at replica 2 since DEP(W2) < VAL; W2 can also be applied at replica 0
Summary
There are good reasons to introduce replication.
However, replication introduces consistency problems.
Doing so may severely degrade performance, especially in large-scale systems.
Thus consistency is relaxed.We have studied consistency models and
protocols.