Download - Consistency and Replication (3). Topics Consistency protocols.

Consistency and Replication (3)

Topics

Consistency protocols

Readings

Van Steen and Tanenbaum: 6.5Coulouris: 11,14

Introduction

A consistency protocol describes an implementation of a specific consistency model.

We will look at different architectures that can be used to support different consistency models, but first we look at a basic architectural model.

A Basic Architectural Model for the Management of Replicated Data

FE

Requests andreplies

C

ReplicaC

ServiceClients Front ends

managers

RM

RMFE

RM


A collection of replica managers provides a service to clients.

The clients see a service that gives them access to objects (e.g., calendar or bank accounts) which are replicated.

Each client’s requests are handled by a component called a front end.


The purpose of the front end is to hide the replication from the client process.

The client processes do not know how many replicas there are.

A front end may be implemented in the client’s address space or it may be a separate process.

Replicas coordinate in preparation to execute the request consistently.


Replica managers execute requestsOne or more replicas may respond to the

application (through the front end).

Primary-Based Protocols

In primary-based protocols, each data item x in the data store has an associated primary, which is responsible for coordinating write operations on x.

Primary-Backup Protocols Read operations are performed on a locally available copy. Write operations are done at a fixed primary copy. The primary performs the update on its local copy of x and

then forwards the update to all the other replicas (which are considered to be backups).

Primary-Based Protocols

Primary-Backup Protocols (cont) Each backup server performs the update as well and sends an

acknowledgement back to the primary. When all backup servers have updated their local copy the primary

sends an acknowledgement back to the initial process.

This implements sequential consistency. The primary RM is a performance bottleneck Can tolerate F failures for F+1 RMs SUN NIS (yellow pages) uses passive replication:

client can contact primary or backup servers for reads, but only primary servers for updates.

The Primary-Backup Protocol

FEC

FEC

RM

Primary

Backup

Backup

RM

RM

Replicated-Write Protocols

In replicated-write protocols, write operations can be carried out at multiple replicas instead of only one (as seen in the case of primary-based replicas).

Operations need to be carried out in the same order everywhere.

We discussed one approach for doing so that uses Lamport’s timestamps.

Using Lamport timestamps does not scale well in large distributed systems.


An alternative approach to achieving total order is to use a central coordinator which is sometimes called a sequencer. Forward each operation to the sequencer. Sequencer assigns a unique sequence number and

subsequently forwards the operation to all replicas. Operations are carried out in the order of their sequence

number. Hmm. This resembles primary-based consistency

protocols.

Useful for sequential consistency.


The use of a sequencer does not solve the scalability problem.

A combination of Lamport timestamps and sequencers may be necessary.

The approach is summarized as follows: Each process has a unique identifier, pi, and keeps a sent

message counter ci. The process identifier and message counter uniquely identify a message.

Active processes (or a sequencer) keep an extra counter: ti. This is called the ticket number. A ticket is a triplet (pi, ti, (pj, cj)).

Replicated-Write Protocols Approach Summary (cont)

An active process issues tickets for its own messages and for messages from its associated passive processes (these are processes that are not sequencers).

Passive processes multicast their messages to all group processes which then wait for a ticket stating the total order of each message.

The ticket is sent by each passive process’s sequencer. Lamport’s totally ordered multicast algorithm is used among the

sequencers to determine the order of update operations. When an operation is allowed, each sequencer sends the ticket to its

associated passive processes. It is assumed that the passive process receives these tickets in the order sent.


Approach Summary (cont) If a sequencer terminates abnormally, then one of the

passive sequencers associated with it can become the new sequencer.

An election algorithm may be used to choose the new sequencer.


Let’s say that we have 6 processes: p1,p2,p3,p4,p5,p6

Assume that p1,p2 are sequencers; p3,p4 are associated with p1 and p5,p6 are associated with p2

Let’s say that p3 sends a message which is identified by (p3 , 1).

p1 generates a ticket as follows: (p1, 1, (p3 , 1))

The ticket number is generated using the Lamport clock algorithm.


Let’s say that p5 sends a message which is identified by (p5 , 1).

p2 generates a ticket as follows: (p2, 1, (p3 , 1))

Which update gets done first? Basically, p1,p2

will apply Lamport’s algorithm for totally ordered multicast.

When an update operation is allowed to proceed, the sequencers send messages to their associated processes.

Gossip Architecture

We just studied some architectures for sequential consistency. What about causal consistency?

The Gossip Architecture supports causally-consistent lazy replication which in essence refers to the potential causality between read and write operations.

Clients are allowed to communicate with each other, but will then have to exchange information on the operations they performed on the data store. This exchange of information is done through gossip messages.

Gossip Architecture

Gossip Architecture

Each RMi maintains for its local copy the vector timestamp VAL(i) VAL(i)[i]: the total number of completed write requests

that have been sent from a client to RMi VAL(i)[j]: the total number of completed write requests

that have been sent from RMj to RMi This is referred to as the value timestamp and it reflects the

updates that have been completed at the replica. This timestamp is attached to the reply of a read operation.

Gossip Architecture Each RMi maintains for its local copy the vector

timestamps WORK(i) which represents those write operations that been been received (but not necessarily processed) at RMi WORK(i)[i]: the total number of write requests that have

been sent from a client to RMi including those that have been completed by RMi.

WORK(i)[j]: the total number of write requests that have been sent from RMj to RMi including those that have been completed by RMi.

This is referred to as the replica timestamp. This timestamp is attached to the reply of a write operation.

Gossip Architecture Each client keeps track of the writes that it has seen so

far. The client C maintains a vector timestamp LOCAL(C) with LOCAL (C )[i] set equal to the most recent value of the number of writes seen at RMi (from C’s view point).

This vector timestamp is attached to every request sent to a replica.

Note that the client can contact a different replica each time it wants to read or write data.

Two front ends may exchange messages directly; these messages also carry the timestamp represented by LOCAL (C).

Gossip Architecture

Write log (queue) Every write operation, when received by a replica, is

recorded in the update log of the replica. Two reasons for this:

The update cannot be applied yet; it is held back It is uncertain if the update has been received by all

replicas. The entries are sorted by timestamp.

A similar log is needed for read operations. This is referred to as the read log (or queue).

Gossip Architecture

The Executed Operation table The same write operation may arrive at a replica from a

front end and in a gossip message from another replica. To present an update from being applied twice, the replica

keeps a list of identifiers of the write operations that have been applied so far.

Gossip Architecture

Processing read request R from C Let DEP (R) be the timestamp associated with R. It is set to

LOCAL(C). The request is sent to RMi (with DEP (R)) which stores the request in

its read queue. The read request is processed if DEP(R)[j] <= VAL(i)[j] (for all j).

This indicates that RMi has seen the same writes as the client.

As soon as a read operation can be carried out, RMi returns the value of the requested data item to the client, along with VAL(i).

LOCAL(C) is adjusted to the value max{LOCAL(C)[j],VAL(i)[j]} for all j.

This make sense since the value returned by read is potentially the cumulative result of all previous writes.

Gossip Architecture

Performing a read operation at a local copy.

Gossip Architecture Processing a write operation, W, from C

Let DEP (W) be the timestamp associated with W. It is set to LOCAL(C) .

When the request is received by RMi it increments WORK(i)[i] by 1 but leaves the other entries intact.

This is done so that WORK reflects that RMi has received the latest write request. At this point it isn’t known if it can be carried out.

A timestamp ts(W) is derived from DEP(W) by setting ts(W)[i] to WORK(i)[i]; the rest of entries are as found in DEP(W).

This timestamp is sent back as an acknowledgement to the client, which subsequently adjusts LOCAL(C) by setting each kth entry to max{LOCAL(C)[k],ts(W)[k]}.

Gossip Architecture

Processing Write Operations (cont) The write request W is processed if DEP(W)[j] <=

VAL(i)[j] (for all j). This indicates that RMi has seen the same writes as

the client. This is referred to as the stability condition.

The write operation takes place. What if there exists a j such that DEP(W)[j] >

VAL(i)[j]? This would indicate that there was a write seen

by the client that is not yet seen by RMi.

Gossip Architecture Processing Write Operations(cont)

VAL(i) is adjusted by setting each jth entry to max{VAL(i)[j],ts(W)[j]}.

Recall that ts(W)[j] is set to DEP(W)[j] for all j != i and is set to WORK(i)[i] for j = i(which had been incremented upon receiving the write request; the end result is that VAL(i) is incremented by 1).

The following two conditions are satisfied: All operations sent directly to RMi from other clients but that preceded

W, have been processed. ts(W)[i] = VAL(i)[i] + 1

All write operations that W depends on have been processed. ts(W)[j] <= VAL(i)[j] for all j != i

Gossip Architecture

Performing a write operation at a local copy.

Gossip Architecture For every gossip message received by RMj from RMi,

does the following: RMj adjusts WORK(j) by setting each kth entry equal to

max{WORK(i)[k],WORK(j)[k]} RMj merges the write operations sent by RMi with its own Apply those writes that have become stable i.e., a write

request W is processed if DEP(W)[j] <= VAL(i)[j] (for all j). A write from RMj that is processed should cause VAL(i)[j] to be incremented by 1.

A gossip message need not contain the entire log, if it is certain that some of the updates have been seen by the receiving replica.

Gossip Architecture (Example)

VAL = (0,0,0)WORK=(0,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1

replicasLOCAL = (0,0,0)

LOCAL = (0,0,0)

Initial state

VAL = (0,0,0)WORK=(0,0,0)

0

1


VAL = (0,0,0)WORK=(0,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,0)

Client 0 sends a write, W0, to replica 0

VAL = (0,0,0)WORK=(0,0,0)

0

1

DEP(W0)=(0,0,0)


VAL = (0,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,0)

WORK is updated

VAL = (0,0,0)WORK=(0,0,0)

0

1


VAL = (0,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,0)

client 0 receives an ack from replica 0 for its writeLOCAL changes from (0,0,0) to (1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

1

ack (ts(W0))


VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,0)

W0 is applied since DEP(W0) <= VAL; VAL changes

VAL = (0,0,0)WORK=(0,0,0)

0

1


VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)

Represents state after Client 1 sends a write,W1, to replica 2

VAL = (0,0,1)WORK=(0,0,1)DEP(W1)=(0,0,0)ts(W1)=(0,0,1)

0

1


VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)

Client 0 sends a write message W2 to replica 2;Cannot be done yet since replica 2 didn’t see the write done at replica 1

VAL = (0,0,1)WORK=(0,0,2)DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)

0

1

DEP(W2)=(1,0,0)


VAL = (1,0,0)WORK=(1,0,0)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)

An ack has been returned to 0 which then updates LOCALfrom (1,0,0) to (1,0,2)


0

1

ack(ts(W2))


VAL = (1,0,0)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)

Replica 0 and 2 exchange update propagation messages (gossip)WORK at both replicas is adjusted


0

1


VAL = (1,0,0)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)

Replica 0 has one write operation (W0). This is sent to replica 2 withDEP(W0). Replica 2 has write operation(W1). This is sent to replica 2 with DEP(W1). Replica 2 also sends W2 with DEP(W2)

VAL = (0,0,1)WORK=(1,0,2) DEP(W1)=(0,0,0)ts(W1)=(0,0,1)DEP(W2)=(1,0,0)ts(W2)=(1,0,2)

0

1


VAL = (1,0,0)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)


0

1

Replica 2 can carry out W0 since DEP(W0) < VAL Replica 0 can carry out W1 since DEP(W1) <= VAL


VAL = (1,0,1)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)


0

1

VAL in replica 0 and replica 2 are updated


VAL = (1,0,1)WORK=(1,0,2)DEP(W0)=(0,0,0)ts(W0)=(1,0,0)

VAL = (0,0,0)WORK=(0,0,0)

0

2

1


LOCAL = (0,0,1)


0

1

W2 can now be executed at replica 2 since DEP(W2) < VAL; W2 can also be applied at replica 0

Summary

There are good reasons to introduce replication.

However, replication introduces consistency problems.

Doing so may severely degrade performance, especially in large-scale systems.

Thus consistency is relaxed.We have studied consistency models and

protocols.