+ All Categories
Home > Documents > Fully distributed three-tier active software replication

Fully distributed three-tier active software replication

Date post: 10-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
13
Fully Distributed Three-Tier Active Software Replication Carlo Marchetti, Roberto Baldoni, Sara Tucci-Piergiovanni, and Antonino Virgillito Abstract—Keeping strongly consistent the state of the replicas of a software service deployed across a distributed system prone to crashes and with highly unstable message transfer delays (e.g., the Internet), is a real practical challenge. The solution to this problem is subject to the FLP impossibility result, and thus there is a need for “long enough” periods of synchrony with time bounds on process speeds and message transfer delays to ensure deterministic termination of any run of agreement protocols executed by replicas. This behavior can be abstracted by a partially synchronous computational model. In this setting, before reaching a period of synchrony, the underlying network can arbitrarily delay messages and these delays can be perceived as false failures by some timeout-based failure detection mechanism leading to unexpected service unavailability. This paper proposes a fully distributed solution for active software replication based on a three-tier software architecture well-suited to such a difficult setting. The formal correctness of the solution is proved by assuming the middle-tier runs in a partially synchronous distributed system. This architecture separates the ordering of the requests coming from clients, executed by the middle-tier, from their actual execution, done by replicas, i.e., the end-tier. In this way, clients can show up in any part of the distributed system and replica placement is simplified, since only the middle-tier has to be deployed on a well-behaving part of the distributed system that frequently respects synchrony bounds. This deployment permits a rapid timeout tuning reducing thus unexpected service unavailability. Index Terms—Dependable distributed systems, software replication in wide-area networks, replication protocols, architectures for dependable services. æ 1 INTRODUCTION R EPLICATION is a classic technique used to improve the availability of a software service. Architectures for implementing software replication with strong consistency guarantees (e.g., [8], [15], [20], [21], [27], [28], [29], [31], [35], [36]) typically use a two-tier approach. Clients send their requests to the replica tier that ensures all replicas are in a consistent state before returning a reply to the client. This requires replicas (and sometimes even clients, e.g., [34]) to run complex agreement protocols [12], [23]. From a theoretical viewpoint, a run of these protocols terminates if the underlying distributed system infrastructure ensures a time t after which (unknown) timing bounds on process speeds and message transfer delays will be established, i.e., a partially synchronous computational model [9], [18]. 1 Let us remark that in practice partial synchrony only imposes that after t there will be a period of synchrony “long enough” to terminate a run [9]. Before the system reaches a period of synchrony, running distributed agreement protocols among replicas belonging to a two-tier architecture for software replication can be an overkill [1]. Under these conditions, the replicated service can show unavailability periods with respect to clients only due to replication management (even though the service remains correct). Intuitively, this can be explained by noting that replicas use timeouts to detect failures. Hence, if messages can be arbitrarily delayed by the network, then timeouts may expire even if no failure has occurred, causing the protocol to waste time without serving client requests. The use of large timeouts can alleviate this phenomenon at the price of reducing the capability of the system to react upon the occurrence of some real failure. One simple way to mitigate this problem is to observe that in a large and complex distributed system (e.g., the Internet), there can be regions that reach a period of synchrony before others, e.g., a LAN, a CAN, etc. Therefore, placing replicas over one of such “early- synchronous” regions can reduce such service unavailabil- ity shortening the timeout tuning period. However, in many cases, the deployment of replicas is not in the control of the protocol deployer, but it is imposed by organizational constraints of the provider of the service (e.g., a server may not be moved from its physical location). In this paper, we propose the use of a three-tier architecture for software replication to alleviate the unavail- ability problem that has been previously introduced. This architecture is based on the idea of “physically interposing”a middle-tier between clients (client-tier) and replicas (end- tier) and to “layer” a sequencer service on top of a total order protocol only within the middle-tier. This approach is motivated by the following main observation: three-tier replication facilitates a sharp separation between the replica- tion logic (i.e., protocols and mechanism necessary for IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006 1 . The authors are with the Dipartimento di Informatica e Sistemistica, Universita degli Studi di Roma “La Sapienza,” Via Salaria 113, 00198 Roma, Italy. E-mail: {marchet, baldoni, tucci, virgi}@dis.uniroma1.it. Manuscript received 7 Mar. 2005; revised 30 June 2005; accepted 20 July 2005; published online 25 May 2006. Recommended for acceptance by D. Blough. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-0199-0305. 1. This need of synchrony is a consequence of the fact that the problem of “keeping strongly consistent the state of a set of replicas” boils down to the Consensus problem. Therefore, it is subject to the FLP impossibility result [19], stating that it is impossible to design a distributed consensus protocols ensuring both safety and deterministic termination over an asynchronous distributed system. 1045-9219/06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society
Transcript

Fully Distributed Three-TierActive Software Replication

Carlo Marchetti, Roberto Baldoni, Sara Tucci-Piergiovanni, and Antonino Virgillito

Abstract—Keeping strongly consistent the state of the replicas of a software service deployed across a distributed system prone to

crashes and with highly unstable message transfer delays (e.g., the Internet), is a real practical challenge. The solution to this problem

is subject to the FLP impossibility result, and thus there is a need for “long enough” periods of synchrony with time bounds on process

speeds and message transfer delays to ensure deterministic termination of any run of agreement protocols executed by replicas. This

behavior can be abstracted by a partially synchronous computational model. In this setting, before reaching a period of synchrony, the

underlying network can arbitrarily delay messages and these delays can be perceived as false failures by some timeout-based failure

detection mechanism leading to unexpected service unavailability. This paper proposes a fully distributed solution for active software

replication based on a three-tier software architecture well-suited to such a difficult setting. The formal correctness of the solution is

proved by assuming the middle-tier runs in a partially synchronous distributed system. This architecture separates the ordering of the

requests coming from clients, executed by the middle-tier, from their actual execution, done by replicas, i.e., the end-tier. In this way,

clients can show up in any part of the distributed system and replica placement is simplified, since only the middle-tier has to be

deployed on a well-behaving part of the distributed system that frequently respects synchrony bounds. This deployment permits a rapid

timeout tuning reducing thus unexpected service unavailability.

Index Terms—Dependable distributed systems, software replication in wide-area networks, replication protocols, architectures for

dependable services.

1 INTRODUCTION

REPLICATION is a classic technique used to improve theavailability of a software service. Architectures for

implementing software replication with strong consistencyguarantees (e.g., [8], [15], [20], [21], [27], [28], [29], [31], [35],[36]) typically use a two-tier approach. Clients send theirrequests to the replica tier that ensures all replicas are in aconsistent state before returning a reply to the client. Thisrequires replicas (and sometimes even clients, e.g., [34]) torun complex agreement protocols [12], [23]. From atheoretical viewpoint, a run of these protocols terminatesif the underlying distributed system infrastructure ensuresa time t after which (unknown) timing bounds on processspeeds and message transfer delays will be established, i.e.,a partially synchronous computational model [9], [18].1 Letus remark that in practice partial synchrony only imposesthat after t there will be a period of synchrony “longenough” to terminate a run [9].

Before the system reaches a period of synchrony,

running distributed agreement protocols among replicas

belonging to a two-tier architecture for software replicationcan be an overkill [1]. Under these conditions, the replicatedservice can show unavailability periods with respect toclients only due to replication management (even thoughthe service remains correct). Intuitively, this can beexplained by noting that replicas use timeouts to detectfailures. Hence, if messages can be arbitrarily delayed bythe network, then timeouts may expire even if no failure hasoccurred, causing the protocol to waste time withoutserving client requests. The use of large timeouts canalleviate this phenomenon at the price of reducing thecapability of the system to react upon the occurrence ofsome real failure. One simple way to mitigate this problemis to observe that in a large and complex distributed system(e.g., the Internet), there can be regions that reach a periodof synchrony before others, e.g., a LAN, a CAN, etc.Therefore, placing replicas over one of such “early-synchronous” regions can reduce such service unavailabil-ity shortening the timeout tuning period. However, in manycases, the deployment of replicas is not in the control of theprotocol deployer, but it is imposed by organizationalconstraints of the provider of the service (e.g., a server maynot be moved from its physical location).

In this paper, we propose the use of a three-tierarchitecture for software replication to alleviate the unavail-ability problem that has been previously introduced. Thisarchitecture is based on the idea of “physically interposing” amiddle-tier between clients (client-tier) and replicas (end-tier) and to “layer” a sequencer service on top of a total orderprotocol only within the middle-tier. This approach ismotivated by the following main observation: three-tierreplication facilitates a sharp separation between the replica-tion logic (i.e., protocols and mechanism necessary for

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006 1

. The authors are with the Dipartimento di Informatica e Sistemistica,Universita degli Studi di Roma “La Sapienza,” Via Salaria 113, 00198Roma, Italy. E-mail: {marchet, baldoni, tucci, virgi}@dis.uniroma1.it.

Manuscript received 7 Mar. 2005; revised 30 June 2005; accepted 20 July2005; published online 25 May 2006.Recommended for acceptance by D. Blough.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0199-0305.

1. This need of synchrony is a consequence of the fact that the problem of“keeping strongly consistent the state of a set of replicas” boils down to theConsensus problem. Therefore, it is subject to the FLP impossibility result[19], stating that it is impossible to design a distributed consensus protocolsensuring both safety and deterministic termination over an asynchronousdistributed system.

1045-9219/06/$20.00 � 2006 IEEE Published by the IEEE Computer Society

managing software replication) and the business logicembedded in the end-tier. Therefore, the middle-tier couldbe deployed over a region of a distributed system showing anearly-synchronous behavior where timeouts can be quicklytuned, thus limiting service unavailability periods.

We exploit the three-tier architecture to implement activereplication over a set of deterministic replicas.2 To this aim,the middle-tier is in charge of accepting client requests,evaluating a total order of them and forwarding them to theend-tier formed by deterministic replicas. Replicas processrequests according to the total order defined in the middle-tier and return results to the latter. The middle-tier waits forthe first reply and forwards it to clients.

We present a fully distributed solution for the middle-tier that does not rely on any centralized service. Morespecifically, the paper presents in Section 2 the formalspecification of active software replication. Section 3 detailsthe three-tier system model. Section 4 introduces the formalspecification of the main component of the middle-tier,namely, the sequencer service, which is responsible forassociating in a fault-tolerant manner a request of a client toa sequence number. In the same section, a fully distributedimplementation of the sequencer service is proposed basedon a total order protocol. Section 5 details the completethree-tier software replication protocol while its correctnessproof is given in Section 6. Even though the paper focuseson problem solvability, it also discusses in Section 7 bothpracticality of the assumptions done in the system modeland efficiency issues of the proposed protocol. In particular,it points out, first, how deploying the middle-tier in an earlysynchronous region can help in reducing the serviceunavailability problem and, second, the relation of partialsynchrony with respect to implementations of total orderbuilt on top of different software artifacts, e.g., unreliablefailure detectors [9], group toolkits [8], the Timely Comput-ing Base (TCB) [39].

Let us finally remark that to have a fast client-replicasinteraction, the three-tier architecture needs the fastresponse of just one replica while the two-tier requires amajority of replicas to reply quickly. The price to pay by athree-tier architecture is an additional hop (i.e., a request/reply interaction) for a client-replica interaction. In the restof the paper, Section 8 describes the related work andSection 9 draws some conclusion.

2 A SPECIFICATION OF ACTIVE REPLICATION

Active replication [23], [30], [37] can be specified by takinginto account a finite set of clients and a finite set ofdeterministic replicas. Clients invoke operations onto areplicated server by issuing requests. A request messagereq is a pair hid; opi in which req:id is a unique requestidentifier (unique for each distinct request issued by everydistinct client), and req:op is the actual operation that theservice has to execute. A request reaches all the availablereplicas that process the request invoking the compute(op)method, which takes an operation as an input parameterand returns a result (res). Replica determinism implies that

the result returned by compute(op) only depends from theinitial state of the replicas and from the sequence ofprocessed requests. Results produced by replicas aredelivered to clients by means of replies. A reply messagerep is a pair hid; resi in which rep:id is the unique requestidentifier of the original client request req : rep:id ¼ req:idand rep:res is the result of the processing of req. Tworequests req1 and req2 are equal, i.e., req1 ¼ req2 iffreq1:id ¼ req2:id, and req1 ¼ req2 ) req1:op ¼ req2:op.

A correct implementation of an actively replicateddeterministic service satisfies the following properties:3

Termination. If a client issues a request req � hid; opithen it eventually receives a reply rep � hid; resi, unless itcrashes.

Uniform Agreed Order. If a replica processes a requestreq, i.e., it executes computeðreq:opÞ, as ith request, then thereplicas that process the ith request must process req as ithrequest.4

Update Integrity. For each request req, every replicaexecutes computeðreq:opÞ at most once, and only if a clienthas issued req.

Response Integrity. If a client issues a request req anddelivers a reply rep, then rep:res has been computed bysome replica performing computeðreq:opÞ.

3 SYSTEM MODEL

Processes are classified into three disjoint types: a set C ¼fc1; . . . ; clg of client processes (client-tier), a set H ¼fh1; . . . ; hng of active replication handler (ARH) replicas, aset R ¼ fr1; . . . ; rmg of deterministic end-tier replicas. Aprocess behaves according to its specification until itpossibly crashes. After a crash event a process stopsexecuting any action. A process is correct if it never crashes,otherwise it is faulty.

Point-to-point communication primitives. Clients, replicasand active replication handlers communicate using reliableasynchronous point-to-point channels modelled throughthe send ðm; pjÞ and deliver ðm; pjÞ primitives. The sendprimitive is invoked by a process to send a message m toprocess pj. deliverðm; pjÞ is an upcall executed upon thereceipt of a message m sent by process pj. Channels satisfythe following properties:

(C1) Channel Validity. If a process receives a messagem, then m has been sent by some process.

(C2) Channel Nonduplication. Messages are deliveredto processes at most once.

(C3) Channel Termination. If a correct process sends amessage m to a correct process, the latter eventually deliversm.

Total Order broadcast Communication primitives. ARHreplicas communicate among themselves using a uniform totalorder broadcast (or uniform atomic broadcast) primitive, i.e.,ARH replicas have access to two primitives, namelyTOCast(m) and TODeliver(m;hi), used to broadcast atotally ordered message m to processes in H and to receive

2 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006

2. In [2], it has been shown that the three-tier approach to replication canalso be used to handle nondeterministic replicas.

3. These properties are a specialization to the active replication case ofthe properties proposed in [16].

4. As replicas are deterministic, if they process all requests in the sameorder before failing, then they will produce the same result for each request.This satisfies linearizability [26].

a totally ordered message m sent by some process hi 2 H,respectively. The semantics of these primitives are thefollowing [25]:

(TO1) Validity. If a correct process hi invokesTOCast(m), then all correct processes in H eventuallyexecute TODeliver(m;hi).

(TO2) Uniform Agreement. If a process in H executesTODeliver(m;h‘), then all correct processes in H willeventually execute TODeliver(m;h‘).

(TO3) Uniform Integrity. For any message m, everyprocess in H executes TODeliver(m;h‘) at most once andonly if m was previously sent by h‘ 2 H (invokingTOCast(m)).

(TO4) Uniform Total Order. If a processes hi in H firstexecutes TODeliver(m1; hk) and then TODeliver(m2; h‘),then no process can execute TODeliver(m2; h‘) if it has notpreviously executed TODeliver(m1; hk).

We assume that any TO invocation terminates. Thismeans that it is necessary to assume that in the distributedsystem formed by ARHs and their communication chan-nels, there is a time t after which there are bounds onprocess speeds and message transfer delays, but thosebounds are unknown, i.e., a partial synchrony assumption[18], [9].

Failure Assumptions. The assumption on the terminationof the TO primitives implies that if the specific uniform TOimplementation can tolerate up to f failures, then:

(A1) ARH Correctness. There are at least n� f correctARH replicas.

Moreover we assume:(A2) Replica Correctness. There is at least one correct

end-tier replica.Practicality of these assumptions will be further dis-

cussed in Section 7.

4 THE SEQUENCER SERVICE

The sequencer service is available to each ARH replica. Thisservice returns a unique and consecutive sequence numberfor each distinct client request and it is the basic buildingblock to satisfy the Uniform Agreed Order property of activereplication. Furthermore, the service is able to retrieve arequest (if any) associated to a given sequence number. Thiscontributes to the enforcement of the Termination propertydespite ARH replica crashes. We first propose a specifica-tion and then a fully distributed and fault-tolerantimplementation of the sequencer service (DSS).

4.1 Sequencer Specification

The sequencer service exposes two methods, namelyGETSEQ() and GETREQ(). The first method takes a clientrequest req as input parameter and returns a positiveinteger sequence number #seq. The second method takes apositive integer #seq as input parameter and returns aclient request req previously assigned to #seq (if available),or null otherwise. Formally, the sequencer service isspecified as follows:

Properties. We denote with GETSEQiðÞ ¼ v (respectively,GETREQiðÞ ¼ v) the generic invocation of the GETSEQ()

(respectively, GETREQ()) method performed by the genericARH replica hi 2 H that terminates with a return value v.

A correct implementation of the sequencer service mustsatisfy properties S1...S6 described below. In particular, toensure live interactions of correct ARH replicas with thesequencer service, the following liveness property musthold:

(S1) Termination. If hi is correct, GETSEQiðÞ andGETREQiðÞ eventually return a value v.

Furthermore, the following safety properties on theGETSEQiðÞ invocations must hold:

(S2) Agreement. 8 (GETSEQiðreqÞ ¼ v, GETSEQjðreq0Þ¼ v0Þ, req ¼ req0 ) v ¼ v0

(S3) Uniqueness. 8 (GETSEQiðreqÞ ¼ v, GETSEQjðreq0Þ¼ v0Þ, v ¼ v0 ) req ¼ req0

(S4) Consecutiveness. 8 GETSEQiðreqÞ ¼ v, ðv � 1Þ ^ðv > 1) 9req0 s.t. GETSEQjðreq0Þ ¼ v� 1Þ

The Agreement property (S2) guarantees that two ARHreplicas cannot obtain different sequence numbers for thesame client request; the Uniqueness property (S3) avoidstwo ARH replicas to obtain the same sequence numberfor two distinct client requests; finally, the Consecutivenessproperty (S4) guarantees that ARH replicas invokingGETSEQ() obtain positive integers that are also consecu-tive, i.e., the sequence of client request ordered accordingto the sequence numbers obtained by ARH replicas doesnot present “holes.”

Finally, upon invoking GETREQ(), ARH replicas must beguaranteed of the following safety properties.

(S5) Reading Integrity. 8 GETREQið#seqÞ ¼ v) ððv ¼nullÞ _ ðv ¼ req s:t: GETSEQjðvÞ ¼ #seqÞÞ

(S6) Reading Validity. 8 GETSEQiðreqÞ ¼ v) GETREQ i

ðv� kÞ ¼ v0, 0 � k < v, v0 6¼ nullThe Reading Integrity property (S5) defines the possible

return values of the GETREQ() method that returns eithernull or a client request assigned to the sequence numberpassed as input parameter. Note that a GETREQ() methodimplementation that always returning null satisfies thisproperty. To avoid such an undesirable behavior, theReading Validity property (S6) states that if an ARH replicahi invokes GETSEQiðreqÞ that returns a value v ¼ #seq, itwill then be able to retrieve all the client requestsreq1; . . . ; req#seq assigned to a sequence number #seq0 suchthat 1 � #seq0 � #seq.

4.2 A Fully Distributed Sequencer Implementation

The implementation is based on a uniform total orderbroadcast primitive exploitable by ARH replicas (seeSection 3) used to let the ARHs agree on a sequence ofrequests. In particular, each DSS class locally builds asequence of requests which is updated upon receiving eachrequest for the first time. Following receipts of requestsalready inserted in the sequence are simply filtered out. Asrequests are received in a total order, the local sequence ofeach DSS class evolves consistently with others.

The DSS class pseudocode run by each ARH replica hi ispresented in Fig. 1. It maintains an internal state composedby the Sequenced array (line 1) that stores in the ith locationthe client request assigned to sequence number i, and by a#LocalSeq counter (line 2) pointing to the first free arraylocation (initialized to 1).

The class handles three events, i.e., 1) the invocation ofthe GETSEQ() method (line 3), 2) the invocation of the

MARCHETTI ET AL.: FULLY DISTRIBUTED THREE-TIER ACTIVE SOFTWARE REPLICATION 3

GETREQ() method (line 10), and 3) the arrival of a totallyordered message (line 14).

In particular, upon the invocation of the GETSEQ()

method, it is firstly checked whether the client request(passed as input argument by the invoker) has been alreadyinserted into a Sequenced array location or not (line 5). If it isnot the case, the client request is multicast to all othersequencers (line 6). When the request has been sequenced,i.e., it appears in a location of the Sequenced array (line 7),its position in the array is returned to the invoker as therequest sequence number (line 8).

Upon the invocation of the GETREQ() method (line 10),the class simply returns the value contained in the arraylocation indexed by the integer passed as input parameter(line 12). Therefore, if the array location contains a clientrequest, the latter is returned to the invoker, null is returnedotherwise.

Finally, when a totally ordered message is delivered tothe DSS class by the total order multicast primitive (line 14),it is first checked if the client request contained in themessage already appears in a location of the Sequencedarray (line 15). If it is not the case, the client request isinserted into the Sequenced array in a position indexed bythe #LocalSeq that is then incremented (lines 16-17).

5 A FULLY DISTRIBUTED MIDDLE-TIER PROTOCOL

The proposed protocol strives to maximize service avail-ability by allowing every noncrashed ARH replica toconcurrently:

1. accept client requests,2. order these requests,3. forward ordered requests to the end-tier,4. receive results, and

5. return results to clients.

As a consequence, the replication scheme can shift from apassive one (if the clients send their requests to a singleARH replica) to a form of active replication (if clients sendtheir request to all ARH replicas).5

In order to enforce the active replication specification inthe presence of ARH replica failures and asynchrony ofcommunication channels, we embed within client and end-tier replica processes two message handlers, i.e., RR(retransmission and redirection handler) within clients, andFO (filtering and ordering handler) within end-tier replicas.These handlers intercept and handle messages sent by andreceived from the process they are colocated with.

In particular, RR intercepts all the operation invocationsof the client and generates request messages that are1) uniquely identified and 2) periodically sent to all ARHreplicas according to some retransmission policy, until acorresponding reply message is received from some ARHreplica. Examples of distinct implementations of theretransmission policy could be: 1) sending the client requestto all ARH replicas each time a timeout expires or 2) sendingthe request to a different ARH replica each time the timeoutexpires.

FO intercepts all incoming/outgoing messages from/toARHs in order to ensure ordered request execution (operationsare computed by replicas according to the request sequencenumber piggybacked by ARHs) and duplicate filtering (thesame operation contained in repeated requests is computedonly once). Request messages arriving out of order at FO areenqueued until they can be executed. FO also stores theresult computed for each operation by its replica, alongwith the sequence number of the corresponding requestmessage. This allows FO to generate a reply message uponreceiving a retransmitted request, thus avoiding duplicatecomputations and at the same time contributing to theimplementation of Termination.

An implementation of FO and RR is presented in [33].

5.1 Introductory Examples

Let us introduce the middle-tier protocol using two simpleintroductory examples.

Failure-free run (Fig. 2). In this scenario, client c1 invokesthe Retransmission/Redirection (RR) INVOKE(op1) method toperform operation op1. This method creates a uniquelyidentified request message req1 ¼ hreqid1; op1i and then itsends req1 to an ARH replica (e.g., h1). Upon receiving req1,h1 invokes GETSEQ(req1) on the DSS class to assign a uniquesequence number (1 in the example) to req1. Then, h1 sendsa message containing the pair h1; op1i to all end-tier replicasand starts waiting for the first result. The Filtering andOrdering (FO) message handler of each end-tier replicachecks if the sequence number of the request received is theexpected one with respect to the computation of the replicait wraps, i.e., if the request sequence number is 1 in thisscenario. In the example, the FO handlers of r1 and r2

immediately verify this condition and, thus, invokecomputeðop1Þ on their replicas that produce the result res1.Then, FO sends a message to h1 containing the pair h1; res1i.

4 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006

Fig. 1. Pseudocode of the sequencer class run by ARH replica hi.

5. This replication scheme has been named asynchronous replication in [22](see Section 8).

Upon delivering the first among these messages, h1 sends areply message hreqid1; res1i back to c1. h1 discards followingresults produced for operation op1 by end-tier replicas(corresponding messages are not shown in Fig. 2 forsimplicity). Then, h1 serves req2 sent by c2. To do so, h1

gets the req2’s sequence number (2) from the DSS class,sends a message containing a pair h2; op2i to all end-tierreplicas and waits for the first reply from the end-tier. Notethat in this scenario r3 receives h2; op2i before receivingh1; op1i. However, FO executes operations in the orderimposed by sequence numbers. Therefore, upon receivingh1; req1i, the FO handler of r3 executes both the operationsin the correct order and returns to h1 both the results. Thisensures that the state of r3 evolves consistently with respectthe state of r1 and r2 and contributes to enforcement of theUniform Agreed Order property. As soon as h1 receives thefirst h2; res2i pair, it sends the result back to the client.

Run in the presence of failures (Fig. 3). As in the previousexample, c1 invokes op1 that through the RR componentreaches h1 in a message containing the hreqid1; op1i pair.Then, h1 gets a unique sequence number (1) for the requestby invoking the sequencer. However, in this scenario h1

crashes after having multicast the request to the end-tier. Aschannels are assumed reliable only among correct processes,the request might not be received by some end-tier replicas.In particular, in Fig. 3, the request is received only by r1 andr2. Furthermore, c1 crashes. This implies that req1 will nolonger be retransmitted. Then, h2 serves request req2 sent byclient c2. Therefore, it gets a sequence number (2) invokingGETSEQ(req2). By checking this sequence number against alocal variable storing the maximum sequence numberassigned by h2 to the requests forwarded to the end-tier,h2 determines that it has not previously sent to end-tierreplicas the request assigned to sequence number 1, i.e.,req1. As this request could have been sent by a faulty ARHreplica, in order to enforce the liveness of the end-tier

replicas, h2 sends to the end-tier a message containing theh1; req1i pair, in addition to sending the message containingthe h2; req2i pair necessary to obtain a response to thepending client request req2. Therefore, h2 first invokes thesequencer GETREQ(1) method to obtain req1 and then sendsto end-tier replicas both the h1; req1i and h2; req2i pairs. Inthis way, the unique correct replica of this scenario, i.e., r3,is maintained live and consistent by h2. As usual, h2 returnsto c2 the result of op2 as soon as it receives res2 from r3.

The following section details the protocols run by ARHs.

5.2 ARH Protocol

We distinguish the following message types:Messages exchanged between the client-tier and the

middle-tier. We denote with “Request” the messages sentby clients to ARH replicas, and with “Reply” the messagesfollowing the inverse path.

Messages exchanged between the middle-tier and theend-tier. These messages contain sequence numbers pro-duced by the sequencer and used by replicas to executerequests in a unique total order. Therefore we denote with“TORequest” (totally ordered request) the messages sent byARH replicas to end-tier replicas, and with “TOReply”(totally ordered reply) the messages following the inversepath.

As reported in Fig. 4, each ARH replica embeds a local DSSclass (Sequencer) that implements the sequencer service asdescribed in Section 4, which is initialized at line 3. Theinternal state of each ARH replica is represented by theLastServedReq integer (line 1), which dynamically stores themaximum sequence number among the numbers assigned byhi to the requests forwarded to the end-tier. #seq (line 2) is avariable used to store the sequence number assigned bySequencer to the client request currently being served by hi.ARH replicas handle only one event, i.e., the arrival of a clientrequest in a “Request”type message (line 4). In particular,upon the receipt of a client request, hi first invokes Sequencer

MARCHETTI ET AL.: FULLY DISTRIBUTED THREE-TIER ACTIVE SOFTWARE REPLICATION 5

Fig. 2. A failure-free run of the fully distributed three-tier active replication protocol.

to assign a sequence number to the request (stored in the #seq

variable, line 5). Then ARH controls whether #seq is greater

than LastServedReq þ 1. Note that if #seq > LastServedReq

þ1, then some other ARH replica served some other client

requests with sequence number comprised in the interval

½LastServed Req þ 1;#seq � 1�. In this case, as shown in the

second example of the previous section, hi sends these

requests again to the end-tier (lines 7-9) in order to preserve

the protocol Termination property (S1) despite possible ARH

failures. Requests are retrieved from Sequencer (line 8) thanks

to the Reading Validity property (S6). Then, hi sends to server

replicas the “TORequest” message containing 1) the opera-

tion contained in the client request currently being served and

2) the sequence number #seq assigned to the request (line 10).

Finally, hi updates the LastServedReq variable (line 11) and

waits for the first “TOReply” message containing #seq as

sequence number of the result (line 12). Upon the receipt of

the result, hi forwards the result to the client through a

“Reply” message (line 13).

6 CORRECTNESS PROOFS

In this section, we first show the correctness proof of the

sequencer described in Section 4.1 and then the complete

middle-tier protocol described in Section 5.

6.1 The Sequencer

Theorem 1 ((S1) Termination). If hi is correct, GETSEQiðÞ and

GETREQiðÞ eventually return a value v.

Proof. By contradiction. Suppose hi is correct and invokes a

sequencer class method that never returns. We distin-

guish two cases, i.e., either the method is GETREQ() (lines

10-13) or it is GETSEQ() (lines 3-9):GETREQ() invocation. In this case the invocation can

never block: As soon as the content of the jth position ofthe array is read, the method returns. Contradiction.

GETSEQ() invocation. We further distinguish two cases:either 9#seq : Sequenced½#seq�:id ¼ req:id or 9n#seq :Sequenced½#seq�:id ¼ req:id when the if statement atline 5 is evaluated.

. In the first case, line 7 is executed immediatelyafter line 5 and the clause of the wait statement issatisfied. As a consequence, #seq is returned tothe invoker at line 7. Contradiction.

. In the second case, statement 6 is executed, i.e., theclient request is multicast to other ARH replicas. As9n#seq : Sequenced½#seq�:id ¼ req:id, the executionblocks at statement 7. As the multicast is executedby a correct replica (by hypothesis), from theValidity property (TO1) of the total order primitive,it follows that statement 14 will eventually beexecuted. Therefore at the end of statement 17 thereholds 9#seq : Sequenced½#seq�:id ¼ req:id and thisin turns provokes the execution to satisfy the clauseof the wait statement at line 7. As a consequence,#seq is returned to the invoker. Contradiction. tu

Theorem 2 ((S2) Agreement). 8 (GETSEQiðreqÞ ¼ v, GETSEQ

jðreq0Þ ¼ v0Þ, req ¼ req0 ) v ¼ v0.Proof. By contradiction. Suppose hi invokes GETSEQiðreqÞ

that returns #seqi, hj invokes GETSEQjðreqÞ that returns#seqj and #seqi 6¼ #seqj.

From the pseudocode of Fig. 1 (lines 7-8), it follows thatin hi, Sequenced½#seqi� ¼ req and in hj, Sequenced½#seqj�¼ req. To insert a request into the Sequenced array, ageneric ARH replica must execute statement 16 that isexecuted iff the two conditions at lines 14-15 hold. Theseconditions imply that each ARH replica inserts a clientrequest into Sequenced at most once. Without loss ofgenerality, we suppose that every message delivered toeach ARH replica contains a distinct request. As aconsequence, statement 16 is executed by both hi and hj

6 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006

Fig. 3. A run of the fully distributed three-tier active replication protocol in presence of failures.

each time statement 14 is executed, the Sequenced array ineach ARH replica reflects the order of its messagedeliveries, and Sequenced½k� contains the kth messagedelivered at statement 14. Then, hi has delivered m ¼ reqas #seqith message, while hj has delivered m ¼ req as#seqjth message. Without loss of generality, suppose that#seqi ¼ #seqj � 1. This implies that hj delivered at leastone message m0 6¼ m before m. This violates property TO4of the total order primitive. Contradiction. tu

Theorem 3 ((S3) Uniqueness). 8 (GETSEQiðreqÞ ¼ v,GETSEQjðreq0Þ ¼ v0Þ, v ¼ v0 ) req ¼ req0.

Proof. By contradiction. Suppose that hi invokes GETSEQ

iðreqÞ that returns #seq and hj invokes GETSEQjðreq0Þthat returns #seq, and req 6¼ req0. From the sequencerAgreement property (S2), let us suppose i ¼ j without lossof generality. However, if hi invokes GETSEQ(req) andGETSEQ(req0) both returning #seq, from statements 7-8(Fig. 1) it follows Sequenced½#seq� ¼ req (when GETSEQ

(req) is invoked) and Sequenced½#seq� ¼ req0 (whenGETSEQ(req0) is invoked), i.e., the value of theSequenced½#seq� location has been modified betweentwo method invocations. By noting that the value of thegeneric Sequenced array location is written at most once(statements 16-17), i.e., once the location indexed byLocalSeq has been written there’s no way to write itagain, from req 6¼ req0 it follows Sequenced½#seq� 6¼Sequenced½#seq�. Contradiction. tu

Theorem 4 ((S4) Consecutiveness). 8 GETSEQiðreqÞ ¼ v,ðv � 1Þ ^ ðv > 1) 9req0 s.t. GETSEQjðreq0Þ ¼ v� 1Þ.

Proof. By contradiction. First suppose that hi invokesGETSEQ(req) that returns a sequence number #seq < 1.From pseudo-code of Fig. 1 statements 7-8, it follows that9#seq : Sequenced½#seq�:id ¼ req:id ^#seq < 1. There-fore, hi previously executed statement 16 havingLocalSeq ¼ #seq < 1. However, LocalSeq is initializedto 1 and is never decremented. Contradiction. Therefore,#seq � 1.

Suppose that hi invokes GETSEQiðreqÞ that returns#seq > 1 and that there do not exist a client request req0

and an ARH replica hj such that if hj invokes GETSEQ j

ðreq0Þ, it obtains #seq � 1 as result. As hi obtained #seq asthe result of the GETSEQiðreqÞ, the sequencer Agreementproperty (S2) ensures each ARH replica hj that success-fully invokes GETSEQjðreqÞ, returns #seq. The sequencerTermination property (S1) ensures that the method even-tually returns in correct replicas. Then, let hj be a correctreplica that invokes GETSEQjðreqÞ; from #seq > 1, itfollows that when hj executes statement 16, LocalSeq> 1. Therefore,LocalSeq has been previously incrementedat statement 17, i.e., hj previously inserted in Sequenced½#seq � 1� the content of a message m ¼ req0 and thisimplies Sequenced½#seq � 1� 6¼ null. This implies thateventually, if hj invokes GETSEQjðreq0Þ, it will obtain#seq � 1 as invocation result. Contradiction. tu

Theorem 5 ((S5) Reading Integrity). 8 GETREQ ið#seqÞ ¼v ) ððv ¼ nullÞ _ ðv ¼ req s.t. GETSEQjðvÞ ¼ #seqÞÞ.

Proof. By contradiction. Suppose hi invokes GETREQ ið#seqÞthat returns a value v, v 6¼ null and 8 GETSEQ jðvÞ 6¼ #seq.Without loss of generality suppose i ¼ j and that hi firstinvokes GETREQið#seqÞ ¼ v and then GETSEQiðvÞ. Frompseudocode in Fig. 1, it follows v ¼ Sequenced½#seq�(statement 12) and that Sequenced½#seq� 6¼ null (byhypothesis). From statement 16, Sequenced ½#seq� 6¼ nullimplies Sequenced½#seq� ¼ v ¼ req. From statement 15, itfollows that req is inserted only in Sequenced½#seq�.Therefore, from statements 5-8, GETSEQiðreqÞ ¼ #seq.Contradiction. tu

Theorem 6 ((S6) Reading Validity). 8 GETSEQiðreqÞ ¼ v)GETREQiðv� kÞ ¼ v0, 0 � k < v, v0 6¼ null.

Proof. By contradiction. Suppose GETSEQiðreqÞ ¼ v andGETREQið#seq � kÞ, 0 � k < #seq, returns v ¼ null. With-out loss of generality, suppose v ¼ #seq ¼ 2 and k ¼ 1. Byhypothesis, Sequenced½2� 6¼ null and this implies thatLocalSeq has been previously incremented (passing from

MARCHETTI ET AL.: FULLY DISTRIBUTED THREE-TIER ACTIVE SOFTWARE REPLICATION 7

Fig. 4. Pseudocode of an ARH replica hi.

1 to 2)at statement 17.This in turn implies thatSequenced½1�has been previously written upon the delivery (at state-ment 14) of a client request req, i.e., Sequenced½1� ¼ req. Aswritings in the locations of the Sequenced array areperformed, at most once (from statements 16-17 and bynoting that LocalSeq is never decremented), when hiinvokes GETREQið1Þ that returns v ¼ null, it follows(statement 12) that req ¼ null. Contradiction. tu

6.2 The Three-Tier Protocol

The following assumption let handling process crashes in auniform way, i.e., without considering partial or indepen-dent failures of colocated components.

No independent failures of colocated components. RR,DSS, and FO are colocated with the client, with the ARHreplica and with the end-tier replica processes, respectively.We assume that colocated components do not fail indepen-dently. This implies that a crash of a client, ARH replica,end-tier replica process implies the crash of its RR, DSS, FO,respectively, and vice-versa.

6.2.1 Preliminary Lemmas

Lemma 1. Let req1 and req2 be two requests sent to the end-tier bysome ARH replica at statement 9 or at statement 10 into two“TORequest” messages “½TORequest,” h#seq1; req1:opi� and“½TORequest,” h#seq2; req2:opi�, then #seq1 ¼ #seq2 ,req1 ¼ req2.

Proof. By contradiction. We distinguish the following threecases.

. Both requests are sent by some ARH replica atstatement 10. Noting that #seq1 and #seq2 arethe return values of the GETSEQ() methodinvocation performed at statement 5, and supposeby contradiction #seq1 ¼ #seq2 and req1 6¼ req2.

From the sequencer Uniqueness property (S3), itfollows that #seq1 ¼ #seq2 implies req1 ¼ req2.Contradiction. On the other hand, suppose bycontradiction req1 ¼ req2 and #seq1 6¼ #seq2.From the sequencer Agreement property (S2), itfollows that req1 ¼ req2 implies #seq1 ¼ #seq2.Contradiction.

. A request (say, req1) is sent by some ARH replica atstatement 9 and the other (req2) is sent at statement10. Note that req1 is returned at statement 8 froma GETREQ() invocation with input argument#seq1.

As at statement 5 #seq > #seq1, from thesequencer Reading Validity property (S6) it followsreq1 6¼ null. Therefore, from the sequencer ReadingIntegrity property (S5), it follows that GETSEQ

jðreq1Þ ¼ #seq1. Furthermore, as in the previouscase, #seq2 is the return value of a GETSEQ()

method invocation performed at statement 5, i.e.,GETSEQiðreq2Þ ¼ #seq2. Suppose by contradiction#seq1 ¼ #seq2 and req1 6¼ req2. Again, from thesequencer Uniqueness property (S3) it follows that#seq1 ¼ #seq2 implies req1 ¼ req2. Contradiction.On the other hand, suppose by contradictionreq1 ¼ req2, and #seq1 6¼ #seq2. From the sequen-

cer Agreement property (S2) it follows that req1 ¼req2 implies #seq1 ¼ #seq2. Contradiction.

. Both requests are sent by some ARH replica atstatement 9. In this case both req1 and req2 arereturned at statement 8 from a GETREQ() invoca-tion with input argument #seq1 and #seq2,respectively. In both cases, at statement 5, #seq >#seq1 and #seq > #seq2. From the sequencerReading Validity property (S6), there followreq1 6¼ null, and req2 6¼ null. Therefore, from thesequencer Reading Integrity property (S5), itfollows that GETSEQjðreq1Þ ¼ #seq1, and thatGETSEQiðreq2Þ ¼ #seq2. Suppose by contradiction#seq1 ¼ #seq2 and req1 6¼ req2. Also in this case,from the sequencer Uniqueness property (S3) itfollows that #seq1 ¼ #seq2 implies req1 ¼ req2.Contradiction. On the other hand, suppose bycontradiction req1 ¼ req2 and #seq1 6¼ #seq2.From the sequencer Agreement property (S2) itfollows that req1 ¼ req2 implies #seq1 ¼ #seq2.Contradiction. tu

Lemma 2. If an ARH replica hi has LastServedReq ¼ k, then it

has already sent to end-tier replicas k “TORequest” messages,

i.e., “½TORequest,” h#seqn; reqn:opi� for each n : 1 � n � k.

Proof. By contradiction. Assume that hi has LastServedReq

¼ k > 0 (k ¼ 0 is a trivial case) and it has not sent to end-tier

a “TORequest” message “½TORequest,” h#seqj; reqj:opi�such that j : 1 � j � k.

Without loss of generality, consider the first timethat hi sets LastServedReq to k at line 11. AsLastServedReq is initialized to 0 at line 2, and for each#seq returned by GETSEQ() at line 5 there holds#seq > 0 (from the sequencer Consecutiveness property(S4)), when LastServedReq is set to #seq ¼ k at line 11,this implies #seq ¼ k at line 5. We distinguish twocases:

. k ¼ #seq ¼ 1. This is a trivial case: the conditionat line 6 does not hold, then hi has sent a“½TORequest,” h1; req1:opi� to all end-tier replicas(line 10). Contradiction.

. k ¼ #seq > 1. In this case, the condition at line 6holds, then hi executed lines 7-9 before updatingLastServedReq to k at line 11. This implies that hihas sent to all end-tier replicas a “TORequest”message “½TORequest,” h#seqn; reqn:opi� for eachn : 1 � n � k. Contradiction. tu

6.2.2 Theorems

For the sake of brevity, we will refer to the properties

introduced so far using their identifiers. As an example, we

will refer to Channel Validity, No Duplication, and Termination

as C1, C2, and C3.

Theorem 7 (Termination). If a client issues a request req �hid; opi unless it crashes, it eventually receives a reply

rep � hid; resi.

8 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006

Proof. By contradiction. Assume that a client issues a

request req � hid; opi, it does not crash and it does not

deliver a result. The correctness of the client, along with

the retransmission mechanism implemented by the RR

handler, guarantee that req is eventually sent to all ARH

replicas. Therefore, from A1 and C3, it follows that a

correct ARH replica hc eventually delivers the client

request message.From the algorithm of Fig. 4, upon receiving the req,

hc invokes GETSEQ(REQ) (line 5) that terminates due toS1. This method returns the sequence number #seqassociated with the current request.

Lemma 2 ensures that at line 11 all requests such thattheir sequence number is lower than or equal toLastServedReq (including the current request) have beensent to the end-tier replicas by hc. A2 and C3 guaranteethat at least a correct end-tier replica rc receives all therequests. This ensures that the FO handler, whichexecutes requests according to their sequence numbers,eventually invokes computeðreq:opÞ within rc, and then itsends back the result in a TOreply message.

From the correctness of hc and rc, and from C3, itfollows that the result is delivered to hc (Fig. 4 line 12)that thus sends the reply rep � hreq:id; resi to the client(line 13). For similar reasons, the RR handler eventuallydelivers the result to the client that thus receives theresult. Contradiction. tu

Theorem 8 (Uniform Agreed Order). If an end-tier replica

processes a request req, i.e., executes computeðreq:opÞ), as ith

request, then every other end-tier replica that processes the ith

request will execute req as ith request.

Proof. By contradiction. Assume that an end-tier replica rkexecutes req as ith request and another end-tier replica rhexecutes as ith a request req0 and req 6¼ req0.

The FO handlers of rh and rk ensure that requests areexecuted at most once and according to the sequencenumbers attached to them by ARH replicas at line 10 ofthe pseudocode depicted in Fig. 4.

Therefore, the ith request processed by rk, i.e., req, isassociated with a sequence number #seq ¼ i, i.e., rkdelivered a “TORequest” message containing hi; req:opi.For the same reasons, rh delivered a “TORequest”message containing hi; req0:opi. From Lemma 1, it followsreq ¼ req0. Contradiction. tu

Theorem 9 (Update Integrity). For any request req, every end-

tier replica executes compute(req.op) at most once, and only if

a client has issued req.

Proof. The case in which the same request is executed twice

is trivially addressed by noting that FO handlers filter

out duplicates of TOrequest messages.Assume, thus, by contradiction, that an operation op

executed by a replica, has not been invoked by a client.The FO handler executes only operations contained in

TOrequest messages delivered to rk. From C1, it followsthat if rk delivers a “TORequest” message containinghseq; req:opi, then the TOrequest message has been sentby an ARH replica hi either at line 9 or at line 10 (seeFig. 4).

If the message has been sent at line 10, hi has receivedreq at line 4. This request has been then sent from a clientin a request message (from C1). Contradiction.

Otherwise, if the TOrequest has been sent at line 9, fromS5 and S6 there exists an ARH replicahj that has previouslyexecuted GETSEQjðreqÞ. As GETSEQjðreqÞ (line 5) is alwaysexecuted after the delivery of a client request message(line 4) it follows that req has been sent by a client in arequest message (from C1). Contradiction. tu

Theorem 10 (Response Integrity). If a client issues a requestreq and delivers a reply rep, then rep.res has been computed bysome end-tier replica, which executed computeðreq:opÞ.

Proof. By contradiction. Assume that a client issues arequest req and delivers a reply rep and rep:res has notbeen computed by an end-tier replica.

From C1, if a client delivers rep then an ARH replica hihas previously sent a reply message containing rep to theclient. From the algorithm of Fig. 4, if hi sends a replymessage containing rep � hreq:id; resi (line 13) to the clientthen 1) hi received a client request message req from theclient (line 4), 2) hi invoked GETSEQiðreqÞ returning #seq(line 5), and 3) it has successively delivered h#seq; resifrom a replica (line 12). From C1, h#seq; resi has been sentby the FO handler of an end-tier replica rk. This impliesthat the request has been previously executed invokingcomputeðreq:opÞ. Contradiction. tu

7 PRACTICAL ISSUES

7.1 Practicality of the Assumptions

Most of the assumptions introduced in Section 3 arenecessary to ensure the termination property of ourprotocol. These include the assumption on one replicabeing always up, the assumption on reliable point-to-pointcommunication channels and the assumptions on partialsynchrony of the region of the distributed system whereARHs are deployed, which is at the base of the terminationof the TO broadcast primitive. Let us note that if one ofthese assumptions is violated, only liveness is affected (i.e.,the three-tier protocol blocks), while safety is alwayspreserved. The assumption on the number of correctreplicas is the weakest one under which the system is stillable to provide the service. The assumption on point-to-point communication channels allows link failures, as longas they are repaired in a finite time. In practice, it isimplemented by message retransmission and duplicatesuppression.

The assumption on partial synchrony does not mean thatthe timing bounds have to hold always. Practically, thesebounds have to hold only for a period of time which is longenough to let complete any run triggered by an invocation ofthe TO broadcast primitive.6

7.2 Efficiency of the Three-Tier Architecture

Up to this section, we have focused the attention onproblem solvability of active software replication using athree-tier protocol. In the following, we discuss in which

MARCHETTI ET AL.: FULLY DISTRIBUTED THREE-TIER ACTIVE SOFTWARE REPLICATION 9

6. This follows from the absence of an explicit notion of time inasynchronous system models, in which the term “long enough” cannot befurther characterized and it is commonly replaced by “always.”

settings the proposed architecture reduces the problem ofunexpected service unavailability pointed out in theintroduction. First, we assume that a service deployer mayplace end-tier replicas according to the strategy of theorganization that wants to provide the service. Under thisgiven condition, a protocol deployer has to select a region of adistributed system in which to deploy the middle-tier, i.e.,the ARHs. For the three-tier replication protocol to beefficient, the protocol deployer selects a region that betterthan others enjoys the following two properties:

. The region shows an “early-synchronous” behavior.Early synchronous means that the distributedsystem will reach synchrony bounds of a partially-synchronous system very early in any run triggeredby an invocation of a TO broadcast primitive.Synchronous distributed systems or systems thatexhibit a synchronous behavior “most-of-the-time”are specific instantiations of an “early-synchronous”distributed systems.

. As many as possible of the point-to-point reliablechannels established among ARHs, end-tier replicas,and (possibly) clients show a short latency and a lowloss rate.

As explained below in this section, the first point enablesboth fast reaction to real failures within the middle-tier andinfrequent false failure suspicions. This reduces unexpectedservice unavailability.

Once the previous point has been guaranteed, the secondpoint maximizes the probability of a short service time for arequest. In our protocol, the receipt of the first reply of anend-tier replica at the middle-tier triggers indeed thesending of the reply back to the client. This also pointsout an interesting tradeoff between the number of end-tierreplicas (and, thus, also of channels between the middle andthe end tiers) and the maximization of the probability ofproviding a short end-to-end service time.

7.3 Total Order Implementation Selection

As pointed out above, the protocol deployer is in charge ofdeploying the middle-tier in a distributed system regionthat quickly reaches and maintains synchrony bounds. Thisfollows from the protocol run by ARHs, which, to beefficient, requires rapid termination of the TO primitivemost of the time. To do so, it is important to note that TOimplementations are typically built on top of softwareartifacts, i.e., software modules characterized by the proper-ties they provide. Some examples follow:

. TO implementations built on top of an unreliablefailure detector, e.g. �S, which is characterized byspecific completeness (safety) and accuracy (live-ness) properties [10].

. TO implementations provided for the virtual syn-chrony programming model, adopted by severalgroup toolkits (e.g., [8], [5]), which rely on thespecification of a membership service to enforceliveness [13].

. TO implementations developed on top of the“Timely Computing Base” (TCB, [39], [40]), whichincludes a well-specified timed agreement service [14].

The liveness properties of these software artifacts and theassociated TO implementations are typically implementedusing timeouts. TO implementations are very sensitive tothe values of these timeouts whose definition is up to theprotocol deployer. Differently from two-tier approaches, theproposed protocol allows the protocol deployer to accountfor these issues by selecting a well-defined and possiblyhighly controlled system region that—independently fromend-tier replica deployment—lets a software artifact and,thus, the associated TO implementation work at their bestas frequently as possible.

Let us finally remark that clients and replicas have noconstraints from the point of view of synchrony require-ments. As a consequence, a service deployer does not haveto take this issue into account when deploying the end-tierreplicas, and clients can show up in any part of thedistributed system.

7.4 Garbage Collection

Two points in the proposed three-tier protocol are criticalwith respect to resource consumption: 1) the memory usedby the sequencer service implementation (i.e., ARH) growslinearly with the number of client requests, and 2) FOhandlers store all the results computed by the colocatedend-tier replicas.

To address both these issues, it is worth noting that theRR colocated with each client can be configured to ensurethat 1) clients may transmit a request for a new operationonly if the result of the former operation has been alreadyreceived, and 2) requests are uniquely identified through apair composed by a unique client identifier and a localsequence number, i.e., req:id ¼ hci; seqcii (this is the ap-proach followed in the RR implementation appearing in[33]). Using these simple serialization and request identifi-cation mechanisms, upon receiving a client request message(e.g., req ¼ hhci; seqcii; opi, where seqci is incremented by RReach time a request is sent by ci), ARH and FO can deletefrom their memories all operations and associated resultspertaining requests issued by the same client and having alocal sequence number seq0ci that satisfies seq0ci < seqci .Furthermore, upon receiving req, ARH and FO are alsoenabled to discard request messages from the same clientthat satisfy the former inequality. This follows from notingthat if RR is blocking then a client ci issuing a request reqhas certainly received the reply to all former request (issuedwith a local sequence number lower than seqci ). It is possibleto show that the described mechanisms permit to boundmemory consumption of both ARH and FO components bya linear function of the number of clients without impactingon the protocol correctness. Let us finally remark thatimplementing the mechanisms outlined above passesthrough a simple modification of the specification and ofthe design of the distributed sequencer implementation.Moreover, line 9 of the protocol of Fig. 4 has to be modifiedin order to forward the overall client request to FO (not justthe requested operation). A complete proposal for garbagecollection has been described in [32].

8 RELATED WORK

In the recent years, the use of the three-tier architecturalpattern for building distributed systems is gaining growingpopularity in both industry and research communities.

10 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006

In particular, the most relevant contribution is due toYin et al. in [42]. In this work, authors exploit a three-tierarchitecture for implementing active software replication totolerate byzantine faults of clients, middle-tier and end-tierreplicas by running agreement protocols exclusively withinthe middle-tier, and to further exploit the tiers physicalseparation to enforce some degree of confidentiality (afaulty client never receives information that it is notauthorized to get). The focus of this work is on separatingbyzantine agreement from end-tier replica computation tofavor infrastructure scalability and to enforce confidentiality. Inour work, we adopt a similar scheme to show how this canbe used to isolate the synchrony requirements necessary forbuilding an efficient solution, and decouple these require-ments from replica deployment. Further, tolerating onlycrash failures enables replication of nondeterministicreplicas to be handled through light changes to the codesof ARH and of FO. In particular, upon getting a resultcomputed by an end-tier replica, each FO returns toARH—along with the reply—a state update obtained fromthe same replica. ARHs store ordered state updates—e.g.,using a second sequencer instance—and upon receiving thefollowing request from a client they forward to FOs thenecessary state update(s) along with the sequenced re-quest(s), so that FOs can update the corresponding replicasand make them consistent before letting them compute theresults of new requests. Let us note that this replicationscheme differs from passive replication since there is nonotion of a primary replica and thus no delays in thepresence of the fault of a specific end-tier replica [2].

A similar architectural pattern is used by Verissimo et al.in [40], presenting an architectural construct, namely, theTimely Computing Base (TCB), for real-time applicationsrunning in environments with uncertain timeliness. TheTCB assumes the existence of a small part of a systemsatisfying strict synchrony requirements used to implementa set of services, i.e., timely execution, duration measurement,and timing failure detection. These services are in turnexploited by the remaining large-scale, complex, asynchro-nous part of the system to run only the control part of theiralgorithms with the support of the TCB services. Therefore,the TCB can be regarded as a coverage amplifier of synchronyassumptions for the execution of the time-critical functions ofa system, e.g., of the TCB services. As remarked in theprevious section, the agreement service provided by TCB isone of the software artifacts on which efficient TOprimitives for the three-tier replication protocol can be built.

In [24], Guerraoui and Schiper define a generic con-sensus service based on a client-server architecture forsolving agreement related problems, e.g., atomic commit-ment, group membership, total order multicast, etc. In thisarchitecture, a set of consensus servers run a consensusprotocol on behalf of clients. Authors motivate thisarchitectural choice to favor modularity and verifiability.As in the three-tier protocol, the architecture confines to awell-defined system region the solution of agreementproblems.

Three-tier systems have gained notable popularity in thetransactional system area, in which these architectures areused to sharply separate the client (or presentation) logic(implemented by the client-tier), the business logic (im-plemented by the middle-tier), and the data (maintained in

the end-tier), thus favoring isolation, modularity andmaintainability. Current solutions to reliability in commer-cial three-tier systems are typically transactional [6], [7],[11], [41] and incur in significant overheads upon theoccurrence of failures. As a consequence, recent works, e.g.,[17], [38], [43], compose software replication and high-availability with transaction processing in three-tier archi-tectures. Notably, in [22], Frølund and Guerraoui adopt athree-tier architecture to coordinate distributed transactionswhile enforcing exactly once semantics despite clientreinvocations. According to this scheme, the middle-tieracts as a replicated, highly-available and centralizedtransaction manager that coordinates distributed transac-tions involving a set of database managers accessed throughstandard interfaces.

Let us finally remark that the protocol presented in thispaper is one of the results derived from the InteroperableReplication Logic (IRL) project, carried out in our depart-ment. This project is investigating the three-tier approach tosoftware replication in different settings. Main results of thisproject can be found in [3], [4], [33]. Specifically, [3] exploitsthree-tier approach to develop a middleware platformsupporting the development of fault-tolerant CORBA appli-cations according to the FT-CORBA specification. A simplethree-tier replication protocol is outlined in this paper(middle-tier replicas adopts a primary backup scheme anduse perfect failure detectors). Another three-tier replicationprotocol, appearing in [4], [33], implements active replicationthrough a centralized sequencer service. This protocol incurs alarger message complexity than the one proposed in thispaper and exhibits a single point of failure that shall beremoved using classical replication techniques.

9 CONCLUSION

Software services replicated using a two tier approachwithin a partially synchronous distributed system cansuffer from unexpected unavailability during periods inwhich timing bounds do not hold. The problem is evenworse if the deployment of server replicas cannot becontrolled. In this case, the only way to reduce thisundesirable effect is to design replication protocols thatcan take advantage of regions of a large and complexdistributed system that show an early-synchronous beha-vior. In this paper, we have presented a three-tier protocolfor software replication well-suited to such a setting andproved its formal correctness. The protocol aims to reducethe risk of such unexpected service unavailability bydeploying the middle-tier over an “early-synchronous”region of the distributed system (e.g., a LAN in a networkeddistributed system) while leaving, at the same time, clientsand end-tier replicas to be deployed everywhere. The mainfeature of this protocol is that it allows a fully distributedimplementation of the middle-tier and it ensures thetermination of a request/reply interaction despite the crashof all end-tier replicas but one.

Let us finally remark that the availability of the middle-tierin our protocol becomes a crucial point. However, thisenvironment is managed, then all necessary low-levelmeasures can be taken to maximize the probability that aclient is able reach the middle-tier, for example, backup linesto the Internet, multiple connections to external routers, etc.

MARCHETTI ET AL.: FULLY DISTRIBUTED THREE-TIER ACTIVE SOFTWARE REPLICATION 11

ACKNOWLEDGMENTS

The authors would like to thank Adnan Noor Mian and the

anonymous reviewers for their valuable comments and

suggestions that greatly improved the content of this work.

REFERENCES

[1] O. Bakr and I. Keidar, “Evaluating the Running Time of aCommunication Round over the Internet,” Proc. 21st Ann. Symp.Principles of Distributed Computing, pp. 243-252, 2002.

[2] R. Baldoni and C. Marchetti, “Software Replication in Three-TierArchitectures: Is It a Real Challenge?” Proc. Eighth IEEE WorkshopFuture Trends of Distributed Computing Systems, pp. 133-139, Nov.2001.

[3] R. Baldoni and C. Marchetti, “Three-Tier Replication for FT-CorbaInfrastructures,” Software: Practice and Experience, vol. 33, no. 8,pp. 767-797, 2003.

[4] R. Baldoni, C. Marchetti, and S. Tucci-Piergiovanni, “Asynchro-nous Active Replication in Three-Tier Distributed Systems,” Proc.Ninth IEEE Pacific Rim Symp. Dependable Computing, pp. 19-26, 2002.

[5] B. Ban, “Design and Implementation of a Reliable GroupCommunication Toolkit for Java,” Cornell Univ., Sept. 1998.

[6] P. Bernstein, V. Hadzilacos, and H. Goodman, Concurrency Controland Recovery in Database Systems. Reading, Mass.: Addison-Wesley,1987.

[7] P.A. Bernstein and E. Newcomer, Principles of TransactionProcessing. Morgan-Kaufmann, 1997.

[8] K. Birman and T. Joseph, “Reliable Communication in thePresence of Failures,” ACM Trans. Computer Systems, vol. 5, no. 1,pp. 47-76, Feb. 1987.

[9] T. Chandra and S. Toueg, “Unreliable Failure Detectors forReliable Distributed Systems,” J. ACM, pp. 225-267, Mar. 1996.

[10] T.D. Chandra, V. Hadzilacos, and S. Toueg, “The Weakest FailureDetector for Solving Consensus,” J. ACM, vol. 43, no. 4, pp. 685-722, July 1996.

[11] D. Chappel, “How Microsoft Transaction Server Changes theCOM Programming Model,” Microsoft System J., 1998.

[12] M. Chereque, D. Powell, P. Reynier, J.-L. Richier, and J. Voiron,“Active Replication in Delta-4,” FTCS, pp. 28-37, 1992.

[13] G.V. Chockler, I. Keidar, and R. Vitenberg, “Group Communica-tions Specifications: A Comprehensive Study,” ACM ComputingSurveys, vol. 33, no. 4, pp. 427-469, Dec. 2001.

[14] M. Correia, L.C. Lung, N.F. Neves, and P. Verissimo, “EfficientByzantine-Resilient Reliable Multicast on a Hybrid FailureModel,” Proc. 21st IEEE Symp. Reliable Distributed Systems, pp. 2-11, Oct. 2002.

[15] F. Cristian, H. Aghili, R. Strong, and D. Dolev, “Atomic Broadcast:From Simple Diffusion to Byzantine Agreement,” Proc. 15th Int’lConf. Fault-Tolerant Computing, 1985.

[16] X. Defago, “Agreement-Related Problems: From Semi PassiveReplication to Totally Ordered Broadcast,” PhD thesis, �EEcolePolytechnique Federale de Lausanne, Switzerland, PhD thesisno. 2229, 2000.

[17] Z. Dianlong and W. Zorn, “End-to-End Transactions in Three-TierSystems,” Proc. Third Int’l Symp. Distributed Objects and Applications(DOA ’01), pp. 330-339, 2001.

[18] C. Dwork, N.A. Lynch, and L. Stockmeyer, “Consensus in thePresence of Partial Synchrony,” J. ACM, vol. 35, no. 2, pp. 288-323,Apr. 1988.

[19] M. Fischer, N. Lynch, and M. Patterson, “Impossibility ofDistributed Consensus with One Faulty Process,” J. ACM,vol. 32, no. 2, pp. 374-382, Apr. 1985.

[20] R. Friedman and E. Hadad, “FTS: A High-Performance CORBAFault-Tolerance Service,” Proc. Seventh IEEE Int’l Workshop Object-Oriented Real-Time Dependable Systems (WORDS ’02), pp. 61-68,2002.

[21] R. Friedman and A Vaysburd, “Fast Replicated State Machinesover Partitionable Networks,” Proc. 16th IEEE Int’l Symp. ReliableDistributued Systems (SRDS), Oct. 1997.

[22] R. Guerraoui and S. Frølund, “Implementing E-Transactions withAsynchronous Replication,” IEEE Trans. Parallel and DistributedSystems, vol. 12, no. 2, pp. 133-146, Feb. 2001.

[23] R. Guerraoui and A. Schiper, “Software-Based Replication forFault Tolerance,” Computer, special issue on fault tolerance, vol. 30,pp. 68-74, Apr. 1997.

[24] R. Guerraoui and A. Schiper, “The Generic Consensus Service,”IEEE Trans. Software Eng., vol. 27, no. 1, pp. 29-41, Jan. 2001.

[25] V. Hadzilacos and S. Toueg, “Fault-Tolerant Broadcast and RelatedProblems,” Distributed Systems, S. Mullender, ed., chapter 16,Addison Wesley, 1993.

[26] M. Herlihy and J. Wing, “Linearizability: A Correctness Conditionfor Concurrent Objects,” ACM Trans. Programming Languages andSystems, vol. 12, no. 3, pp. 463-492, 1990.

[27] I. Keidar, “A Highly Available Paradigm for Consistent ObjectReplication,” Master’s thesis, Inst. Computer Science, HebrewUniv., Jerusalem, Israel, 1994.

[28] I. Keidar and D. Dolev, “Efficient Message Ordering in DynamicNetworks,” Proc. 15th ACM Symp. Principles of DistributedComputing (PODC), pp. 68-86, May 1996.

[29] B. Kemme and G. Alonso, “A Suite of Database ReplicationProtocols Based on Group Communications,” Proc. 18th Int’l Conf.Distributed Computing Systems (ICDCS), May 1998.

[30] L. Lamport, “Time, Clocks and the Ordering of Events in aDistributed System,” Comm. ACM, vol. 21, no. 7, pp. 558-565, 1978.

[31] S. Landis and S. Maffeis, “Building Reliable Distributed Systemswith CORBA,” Theory and Practice of Object Systems, vol. 3, no. 1,1997.

[32] C. Marchetti, “A Three-Tier Architecture for Active SoftwareReplication,” technical report, PhD thesis, Dipartimento diInformatica e Sistemistica, Universita degli Studi di Roma “LaSapienza,” 2003.

[33] C. Marchetti, S. Tucci-Piergiovanni, and R. Baldoni, “A Three-TierReplication Protocol for Large Scale Distributed Systems,” IEICETrans. Information Systems, special issue on dependable computing(selection of PRDC-02 papers), vol. 86-D, no. 12, pp. 2544-2552,2003.

[34] L. Moser, P.M. Melliar-Smith, D. Agarwal, R. Budhia, and C.Lingley-Papadopoulos, “Totem: A Fault-Tolerant Multicast GroupCommunication System,” Comm. ACM, vol. 39, no. 4, pp. 54-63,Apr. 1996.

[35] L.E. Moser, P.M. Melliar-Smith, and P. Narasimhan, “ConsistentObject Replication in the Eternal System,” Theory and Practice ofObject Systems, vol. 4, no. 3, pp. 81-92, 1998.

[36] D. Powell, G. Bonn, D. Seaton, P. Verissimo, and F. Waeselynck,“The Delta-4 Approach to Dependability in Open DistributedComputing Systems,” Proc. 18th IEEE Int’l Symp. Fault-TolerantComputing (FTCS-18), June 1988.

[37] F.B. Schneider, “Replication Management Using the State MachineApproach,” Distributed Systems, S. Mullender, ed., ACM Press,Addison Wesley, 1993.

[38] A. Vaysburd, “Fault Tolerance in Three-Tier Applications:Focusing on the Database Tier,” Proc. Int’l Workshop ReliableMiddleware Systems (WREMI ’99), pp. 322-327, 1999.

[39] P. Verissimo and A. Casimiro, “The Timely Computing BaseModel and Architecture,” IEEE Trans. Computers, special issue onasynchronous real-time systems, vol. 51, no. 8, pp. 916-930, Aug.2002.

[40] P. Verissimo, A. Casimiro, and C. Fetzer, “The Timely ComputingBase: Timely Actions in the Presence of Uncertain Timeliness,”Proc. First Int’l Conf. Dependable Systems and Networks, 2000.

[41] G.R. Voth, C. Kindel, and J. Fujioka, “Distributed ApplicationDevelopment for Three-Tier Architectures: Microsoft on WindowsDNA,” IEEE Internet Computing, vol. 2, no. 2, pp. 41-45, 1998.

[42] J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin,“Separating Agreement from Execution for Byzantine FaultTolerant Services,” Proc. 19th ACM Symp. Operating SystemsPrinciples, pp. 253-267, 2003.

[43] W. Zhao, L.E. Moser, and P.M. Melliar-Smith, “Unification ofReplication and Transaction Processing in Three-Tier Architec-tures,” Proc. 22nd Int’l Conf. Distributed Computing Systems(ICDCS ’02), pp. 263-270, 2002.

12 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 7, JULY 2006

Carlo Marchetti received the doctoral degree incomputer engineering in 2003 from the Faculty ofEngineering at the University of Rome “LaSapienza,” where he has been teaching a courseon applied software engineering since 2002.Since his laurea degree, received in 1999, heregularly performs both academic and industrialresearch in the area of distributed systems and,in particular, on group communications, softwarereplication, and on publish&subscribe systems.

Since 2004, he has served as Officer at the Italian Senate.

Roberto Baldoni is a professor of distributedsystems at the University of Rome “La Sapienza.”He published more than 100 papers (from theoryto practice) in the fields of distributed, p2p andmobile computing, middleware platforms, andinformation systems. He is the founder ofMIDdleware LABoratory (MIDLAB), whose mem-bers participate in many industrial, National, andeuropean research projects. He regularly servesas an expert for the EU commission in the

evaluation of EU projects and for the Italian Government in the definitionof the guidelines of the next interoperable software infrastructures for thepublic administration. He also regularly serves on the organizing andprogram committee of the most important scientific conferences in hisareas of interest (e.g., ICDCS, DSN, SRDS, DISC, EDDC, PERCOM,EUROPAR, ISORC, DOA, CoopIS, ICPS, etc.). He was invited to chairthe program committee of the distributed algorithms track of the 19thIEEE International Conference on Distributed Computing Systems(ICDCS-99) and PC cochair of the ACM International Workshop onPrinciples of Mobile Computing (POMC-02). He was the general chair ofthe Eighth IEEE Workshop on Object Oriented Real-Time DependableSystems (WORDS).

Sara Tucci-Piergiovanni received the laureadegree in computer engineering from the Uni-versity of Rome “La Sapienza” in 2002. Herthesis won the 2002 AICA-Confidustria awardfor the best Italian thesis in ICT. Since Novem-ber 2002, she has been a PhD student in theDepartment of Computer Systems and Scienceat the University “La Sapienza.” Her researchinterests include fault-tolerance, software repli-cation, publish and subscribe systems, distrib-

uted shared memories, and dynamic distributed systems.

Antonino Virgillito received the MsC (“Laurea”)and PhD degrees from the University of Roma“La Sapienza,” Italy, respectively, in 2000 and2004. Currently, he holds a postdoctorateresearcher position with the Dipartimento diInformatica e Sistemistica of the University ofRoma “La Sapienza.” In 2005, he also worked asa postdoctorate researcher at INRIA-IRISA,Rennes, France. His main research interestsconcern middleware architectures for large-

scale information dissemination peer-to-peer systems, fault-tolerantmiddleware, cooperative information systems, and ad-hoc networks. Heserved as a PC member of the IEEE International Conference onDistributed Systems (ICDCS 2006).

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

MARCHETTI ET AL.: FULLY DISTRIBUTED THREE-TIER ACTIVE SOFTWARE REPLICATION 13


Recommended