Total Order Broadcast and Multicast Algorithms

Total Order Broadcast and Multicast Algorithms:Taxonomy and Survey

XAVIER DEFAGO

Japan Advanced Institute of Science and Technology andJapan Science and Technology Agency

ANDRE SCHIPER

Ecole Polytechnique Federale de Lausanne, Switzerland

AND

PETER URBAN

Japan Advanced Institute of Science and Technology

Total order broadcast and multicast (also called atomic broadcast/multicast) present animportant problem in distributed systems, especially with respect to fault-tolerance. Inshort, the primitive ensures that messages sent to a set of processes are, in turn,delivered by all those processes in the same total order.

Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]:Distributed Systems; C.2.2 [Computer-Communication Networks]: NetworkProtocolsApplications; Protocol architecture; D.4.4 [Operating Systems]:Communications ManagementMessage sending; Network communication; D.4.5[Operating Systems]: ReliabilityFault-tolerance; H.2.4 [Database Management]:SystemsDistributed databases; H.3.4 [Information Storage and Retrieval]:Systems and SoftwareDistributed systems

General Terms: Algorithms, Reliability, Design

Additional Key Words and Phrases: Distributed systems, distributed algorithms, groupcommunication, fault-tolerance, agreement problems, message passing, total ordering,global ordering, atomic multicast, atomic broadcast, classification, taxonomy, survey

Part of this research was conducted for the program Fostering Talent in Emergent Research Fields in Spe-cial Coordination Funds for Promoting Science and Technology by the Japan Ministry of Education, Culture,Sports, Science and Technology. This work was initiated during Xavier Defagos Ph.D. research at the SwissFederal Institute of Technology in Lausanne {Defago 2000}. Peter Urban was supported by the Japan Societyfor the Promotion of Science, a Grant-in-Aid for JSPS Fellows from the Japanese Ministry of Education, Cul-ture, Sports, Science and Technology, the Swiss National Science Foundations, and the CSEM Swiss Centerfor Electronics and Microtechnology, Inc., Neuchatel.Authors addresses: X. Defago and P. Urban, School of Information Science, JAIST, 1-1 Asahidai,Tatsunokuchi, Nomigun, Ishikawa 923-1292, Japan; email: {defago,urban}@jaist.ac.jp; A. Schiper, 1C-LSR, School of Information and Communication, EPFL, CH-1015 Lausanne, Switzerland; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation. Copy-rights for components of this work owned by others than ACM must be honored. Abstracting with credit ispermitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, [email protected] ACM 0360-0300/04/1200-0001 $5.00

ACM Computing Surveys, Vol. 36, No. 4, December 2004, pp. 150.

2 X. Defago et al.

The problem has inspired an abundance of literature, with a plethora of proposedalgorithms. This article proposes a classification of total order broadcast and multicastalgorithms based on their ordering mechanisms, and addresses a number of otherimportant issues. The article surveys about sixty algorithms, thus providing by far themost extensive study of the problem so far. The article discusses algorithms for both thesynchronous and the asynchronous system models, and studies the respectiveproperties and behavior of the different algorithms.

1. INTRODUCTION

Distributed systems and applications arenotoriously difficult to build. This ismostly due to the unavoidable concur-rency in such systems, combined withthe difficulty of providing a global con-trol. This difficulty is greatly reduced byrelying on group communication primi-tives that provide higher guarantees thanstandard point-to-point communication.One such primitive is called total order1broadcast.2 Informally, the primitive en-sures that messages sent to a set of pro-cesses are delivered by all those processesin the same order. Total order broadcastis an important primitive that plays acentral role, for instance, when imple-menting the state machine approach (alsocalled active replication) [Lamport 1978a;Schneider 1990; Poledna 1994]. It also hasother applications, such as clock synchro-nization [Rodrigues et al. 1993], computersupported cooperative writing, distributedshared memory, and distributed locking[Lamport 1978b]. More recently, it wasalso shown that an adequate use of totalorder broadcast can significantly improvethe performance of replicated databases[Agrawal et al. 1997; Pedone et al. 1998;Kemme et al. 2003].

1Total order broadcast is also known as atomic broad-cast. Both terminologies are currently in use. Thereis a slight controversy with respect to using one overthe other. We opt for the former, that is, total or-der broadcast, because the latter is somewhat mis-leading. Indeed, atomicity suggests a property re-lated to agreement rather than to total order (de-fined in Sect. 2), and the ambiguity has already beena source of misunderstandings. In contrast, total or-der broadcast unambiguously refers to the propertyof total order.2Total order multicast is sometimes used instead oftotal order broadcast. The distinction between thetwo primitives is explained later in the article (Sec-tion 3). When the distinction is not important, we usethe term total order broadcast.

Literature on total order broadcast.There exists a considerable amount ofliterature on total order broadcast, andmany algorithms, following various ap-proaches, have been proposed to solve thisproblem. It is, however, difficult to com-pare them as they often differ with respectto their actual properties, assumptions,objectives, or other important aspects. Itis hence difficult to know which solution isbest suited to a given application context.When confronted with new requirements,the absence of a roadmap to the problemof total order broadcast can lead engineersand researchers to either develop new al-gorithms rather than adapt existing solu-tions (thus reinventing the wheel), or usea solution poorly suited to the applicationneeds. An important step to improve thepresent situation is to provide a classifica-tion of existing algorithms.

Related work. Previous attempts havebeen made at classifying and compar-ing total order broadcast algorithms[Anceaume 1993b; Anceaume and Minet1992; Cristian et al. 1994; Friedman andvan Renesse 1997; Mayer 1992]. However,none is based on a comprehensive surveyof existing algorithms, and hence they alllack generality.

The most complete comparison so farwas done by Anceaume and Minet [1992](an extended version was later publishedin French by Anceaume [1993b]), whotake an interesting approach based on theproperties of the algorithms. Their paperraises some fundamental questions thatinspired a part of our work. It is, how-ever, a little outdated now. In addition,the authors only study seven different al-gorithms, which are not truly represen-tative; for instance, none is based on acommunication history approach (one ofthe five classes of algorithms; details inSection 4.4).

ACM Computing Surveys, Vol. 36, No. 4, December 2004.

Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey 3

Cristian et al. [1994] take a different ap-proach, focusing on the implementation ofthe algorithms, rather than their proper-ties. They study four different algorithms,and compare them using discrete eventsimulation. They find interesting resultsregarding the respective performance ofdifferent implementation strategies. Nev-ertheless, they fail to discuss the respec-tive properties of the different algorithms.Besides, as they compare only four al-gorithms, this work is less general thanAnceaumes [1993b].

Friedman and van Renesse [1997] studythe impact of packing messages on theperformance of algorithms. To this pur-pose, they study six algorithms, includ-ing those studied by Cristian et al. [1994].They measure the actual performance ofthe algorithms and confirm the observa-tions made by Cristian et al. [1994]. Theyshow that packing several protocol mes-sages into a single physical message in-deed provides an effective way to improvethe performance of algorithms. The com-parison also lacks generality, but this isquite understandable as this is not themain concern of their paper.

Mayer [1992] defines a framework inwhich total order broadcast algorithmscan be compared from a performance pointof view. The definition of such a frameworkis an important step toward an extensiveand meaningful comparison of algorithms.However, the paper does not actually com-pare the numerous existing algorithms.

Contributions. In this article, we pro-pose a classification of total order broad-cast algorithms based on the mechanismused to order messages. The reason forthis choice is that the ordering mechanismis the characteristic with the strongestinfluence on the communication patternof the algorithm: two algorithms of thesame class are likely to exhibit similar be-haviors. We define five classes of order-ing mechanisms: communication history,privilege-based, moving sequencer, fixedsequencer, and destinations agreement.

In this article, we also provide a vastsurvey of about sixty published total or-der broadcast algorithms. Wherever pos-sible, we mention the properties and the

assumptions of each algorithm. This is,however, not always possible because theinformation available in the papers is of-ten not sufficient to accurately character-ize the behavior of the algorithm (e.g., inthe face of a failure).

Structure. The article is logically or-ganized into four main parts: specifica-tion, ordering mechanisms and taxon-omy, fault-tolerance, and survey. Moreprecisely, the article is structured as fol-lows. Section 2 presents the specificationof the total order broadcast problem (alsoknown as atomic broadcast). Section 3extends the specification by consideringthe characteristics of destination groups(e.g., single versus multiple groups). InSection 4, we define five classes of totalorder broadcast algorithms, according tothe way messages are ordered: commu-nication history, privilege-based, movingsequencer, fixed sequencer, and destina-tions agreement. Section 5 discusses sys-tem model issues in relation to failures.Section 6 presents the main mechanismson which total order broadcast algorithmsrely to ensure fault-tolerance. Section 7gives a broad survey of total order broad-cast algorithms found in the literature. Al-gorithms are grouped along their respec-tive classes, and we discuss their principalcharacteristics. Section 8 discusses someother issues of interest that are relatedto total order broadcast. Finally, Section 9concludes the article.

2. SPECIFICATION OF TOTAL ORDERBROADCAST

In this section, we give the formal specifi-cation of the total order broadcast prob-lem. As there are many variants of theproblem, we present here the simplestspecification, and discuss other variants inSection 3.

2.1. Notation

Table I summarizes some of the nota-tions used throughout the article. Mis the set containing all possible validmessages. denotes the set of all pro-cesses in the system. Given some arbitrary


4 X. Defago et al.

Table I. NotationM set of all valid messages. set of all processes in the system.sender(m) sender of message m.Dest(m) set of destination processes for

message m.sender set of all sending processes in the

system.dest set of all destination processes in the

system.

message m, sender(m) designates the pro-cess in from which m originates, andDest(m) denotes the set of all destinationprocesses for m.

In addition, sender is the set of all pro-cesses in that can potentially send somevalid message.

sender = {p | p can send some messagem M}. (1)

Likewise, dest is the set of all potentialdestinations of valid messages.

destdef=

mMDest(m). (2)

2.2. Process Failures

The specification of total order broad-cast requires the definition of the notionof a correct process. The following setof process failure classes are commonlyconsidered:

Crash failures. When a process crashes,it ceases functioning forever. This meansthat it stops performing any activity in-cluding sending, transmitting, or receiv-ing any message.

Omission failures. When a process failsby omission, it omits performing someactions, such as sending or receiving amessage.

Timing failures. A timing failure occurswhen a process violates some of the tim-ing assumptions of the system model(details in Section 5.1). Obviously, thistype of failures does not exist in asyn-chronous system models, because of theabsence of timing assumptions in suchsystems.

Byzantine failures. Byzantine failuresare the most general type of failures. A

Byzantine component is allowed any ar-bitrary behavior. For instance, a faultyprocess may change the content of mes-sages, duplicate messages, send unso-licited messages, or even maliciously tryto break down the whole system.

A correct process is defined as a processthat never expresses any of the faulty be-haviors mentioned above.

2.3. Basic Specification of Total OrderBroadcast

We can now give the simplest specifica-tion of total order broadcast. Formally, theproblem is defined in terms of two prim-itives, which are called TO-broadcast(m)and TO-deliver(m), where m Mis some message. When a process pexecutes TO-broadcast(m) (respectivelyTO-deliver(m)), we may say that p TO-broadcasts m (respectively TO-deliversm). We assume that every message mcan be uniquely identified, and carriesthe identity of its sender, denoted bysender(m). In addition, we assume that,for any given message m, and any run,TO-broadcast(m) is executed at most once.In this context, total order broadcastis defined by the following properties[Hadzilacos and Toueg 1994; Chandra andToueg 1996]:

(VALIDITY) If a correct process TO-broadcasts a message m, then iteventually TO-delivers m.

(UNIFORM AGREEMENT) If a process TO-delivers a message m, then all correctprocesses eventually TO-deliver m.

(UNIFORM INTEGRITY) For any message m,every process TO-delivers m at mostonce, and only if m was previously TO-broadcast by sender(m).

(UNIFORM TOTAL ORDER) If processes p andq both TO-deliver messages m and m,then p TO-delivers m before m, if andonly if q TO-delivers m before m.

A broadcast primitive that satisfies allthese properties except Uniform Total Or-der (i.e., that provides no ordering guar-antee) is called a reliable broadcast.



Fig. 1. Violation of uniform agreement (example).

Validity and Uniform Agreement areliveness properties. Roughly speaking,this means that, at any point in time,no matter what has happened up to thatpoint, it is still possible for the propertyto eventually hold [Charron-Bost et al.2000]. Uniform Integrity and UniformTotal Order are safety properties. Thismeans that, if, at some point in time, theproperty does not hold, no matter whathappens later, the property cannot even-tually hold.

2.4. Nonuniform Properties

In the above definition of total order broad-cast, the properties of Agreement and To-tal Order are uniform. This means thatthese properties do not only apply to cor-rect processes, but also to faulty ones. Forinstance, with Uniform Total Order, a pro-cess is not allowed to deliver any messageout of order, even if it is faulty. Conversely,(nonuniform) Total Order applies only tocorrect processes, and hence does not putany restriction on the behavior of faultyprocesses.

Uniform properties are strong guar-antees that might make life easier forapplication developers. Not all applica-tions need uniformity, however, and en-forcing uniformity often has a cost. Forthis reason, it is also important to considerweaker problems specified using nonuni-form properties, though nonuniform prop-erties may lead to inconsistencies at theapplication level. However, an applicationmight protect itself from nonuniformity byvoting (e.g., given an application that col-lects replies from the destinations of a to-tal order broadcast, the application mayvote on the replies received, and considera reply to be effective only after receivingthe same reply from a majority). Nonuni-

form Agreement and Total Order are spec-ified as follows:

(AGREEMENT) If a correct process TO-delivers a message m, then all correctprocesses eventually TO-deliver m.

(TOTAL ORDER) If two correct processes pand q both TO-deliver messages m andm, then p TO-delivers m before m, ifand only if q TO-delivers m before m.

The combinations of uniform andnonuniform properties define four differ-ent specifications to the problem of fault-tolerant total order broadcast. These defi-nitions constitute a hierarchy of problems,as discussed extensively by Wilhelm andSchiper [1995]. However, for simplicity, wesay that a total order broadcast algorithmis uniform when it satisfies both UniformAgreement and Uniform Total Order, andwe say that an algorithm is nonuniformwhen it enforces neither (i.e., only theirnonuniform counterparts). We give no spe-cial name to the two hybrid definitions.

Figure 1 illustrates a violation of theUniform Agreement property with a sim-ple example. In this example, the se-quencer p1 sends a message m, using totalorder broadcast. It first assigns a sequencenumber to m, then sends m to all pro-cesses, and finally, delivers m. Process p1crashes shortly afterwards, and no otherprocess receives m (due to message loss).As a result, no correct process (e.g., p2)will ever be able to deliver m. UniformAgreement is violated, but not (nonuni-form) Agreement: no correct process everdelivers m (p1 is not correct).

Note 1 Byzantine failures and unifor-mity. Algorithms tolerant to Byzantinefailures can guarantee none of the uni-form properties given in Section 2.3. Thisis understandable as no behavior can be


6 X. Defago et al.

Fig. 2. Contamination of correct processes (p1, p2), by a message (m4), basedon an inconsistent state (p3 delivered m3 but not m2).

enforced on Byzantine processes. In otherwords, nothing can prevent a Byzantineprocess from (1) delivering a messagemore than once (violates Integrity), (2) de-livering a message that is not delivered byother processes (violates Agreement), or(3) delivering two messages in the wrongorder (violates Total Order).

Reiter [1994] proposes a more usefuldefinition of uniformity for Byzantine sys-tems. He distinguishes between crashesand Byzantine failures. He says that aprocess is honest if it behaves accordingto its specification, and corrupt otherwise(i.e., Byzantine), where honest processescan also fail by crashing. In this context,uniform properties are those which are en-forced by all honest processes, regardlesswhether they are correct or not. This def-inition is more sensible that the stricterof definition of Section 2.3, as nothing isrequired from corrupt processes.

Note 2 Safety/liveness and uniformity.Charron-Bost et al. [2000] have shownthat, in the context of failures, somenonuniform properties that are commonlybelieved to be safety properties are actu-ally liveness properties. They have pro-posed refinements of the concept of safetyand liveness that avoid the counterintu-itive classification.

2.5. Contamination

The problem of contamination comesfrom the observation that, even with thestrongest specification (i.e., with UniformAgreement and Uniform Total Order), to-tal order broadcast does not prevent afaulty process p from reaching an incon-sistent state before it crashes. This is aserious problem because p can legallyTO-broadcast a message based on thisinconsistent state, and thus contaminate

correct processes [Gopal and Toueg 1991;Anceaume and Minet 1992; Anceaume1993b; Hadzilacos and Toueg 1994].

2.5.1. Illustration. Figure 2 illustratesan example [Charron-Bost et al. 1999;Hadzilacos and Toueg 1994] in which anincorrect process contaminates the cor-rect processes. Process p3 delivers mes-sages m1 and m3, but not m2. So, its stateis inconsistent when it multicasts m4 tothe other processes before crashing. Thecorrect processes p1 and p2 deliver m4,thus becoming contaminated by the in-consistent state of p3. It is important tostress again that the situation depicted inFigure 2 satisfies even the strongest spec-ification presented so far.

2.5.2. Specification. It is possible to ex-tend or reformulate the specification of to-tal order broadcast in such a way thatit disallows contamination. The solutionconsists of preventing any process from de-livering a message that may lead to an in-consistent state.

Aguilera et al. [2000] propose a reformu-lation of Uniform Total Order which, un-like the traditional definition, is not proneto contamination, as it does not allow gapsin the delivery sequence:

(GAP-FREE UNIFORM TOTAL ORDER) If someprocess delivers message m after mes-sage m, then a process delivers m onlyafter it has delivered m.

As an alternative, an older formulationuses the history of delivery and requiresthat, for any two given processes, the his-tory of one is a prefix of the history ofthe other. This is expressed by the follow-ing property [Anceaume and Minet 1992;Cristian et al. 1994; Keidar and Dolev2000]:



(PREFIX ORDER) For any two processes pand q, either hist(p) is a prefix of hist(q),or hist(q) is a prefix of hist(p), wherehist(p) and hist(q) are the sequences ofmessages delivered by p and q, respec-tively.

Note 3. The specification of total orderbroadcast using Prefix Order precludesthe dynamic join of processes (e.g., witha group membership). This can be cir-cumvented, but the resulting propertyis much more complicated. For this rea-son, the simpler alternative proposed byAguilera et al. [2000] is preferred.

Note 4 Byzantine failures and contami-nation. Contamination cannot be avoidedin the face of arbitrary failures. This isbecause a faulty process may be inconsis-tent even if it delivers all messages cor-rectly. It may then contaminate the otherprocesses by broadcasting a bogus mes-sage that seems correct to every other pro-cess [Hadzilacos and Toueg 1994].

2.6. Other Ordering Properties

The Total Order property (see Section 2.3),restricts the order of message deliverybased solely on the destinations, that is,the property is independent of the senderprocesses. The definition can be further re-stricted by two properties related to thesenders, namely, FIFO Order and CausalOrder.

2.6.1. FIFO Order. Total Order alonedoes not guarantee that messages are de-livered in the order in which they are sent(i.e., in first-in/first-out order). Yet, thisproperty is sometimes required by applica-tions in addition to Total Order. The prop-erty is called FIFO Order:

(FIFO ORDER) If a correct process TO-broadcasts a message m before it TO-broadcasts a message m, then no cor-rect process delivers m, unless it haspreviously delivered m.

2.6.2. Causal Order. The notion ofcausality in the context of distributedsystems was first formalized by Lamport

[1978b]. It is based on the relationprecedes3 (denoted by ), definedin his seminal paper and extended in alater paper [Lamport 1986b]. The relationprecedes is defined as follows.

Definition 1. Let ei and e j be twoevents in a distributed system. The tran-sitive relation ei e j holds if any one ofthe following three conditions is satisfied:

(1) ei and e j are two events on the sameprocess, and ei comes before e j ;

(2) ei is the sending of a message m by oneprocess, and e j is the receipt of m byanother process; or,

(3) There exists a third event ek , such that,ei ek and ek e j (transitivity).

This relation defines an irreflexive par-tial ordering on the set of events. Thecausality of messages can be defined by theprecede relationship between their re-spective sending events. More precisely, amessage m is said to precede a message m(denoted m m), if the sending event of mprecedes the sending event of m.

The property of causal order for broad-cast messages is defined as follows[Hadzilacos and Toueg 1994]:

(CAUSAL ORDER) If the broadcast of a mes-sage m causally precedes the broadcastof a message m, then no correct processdelivers m, unless it has previously de-livered m.

Hadzilacos and Toueg [1994] also provethat the property of Causal Order is equiv-alent to combining the property of FIFOOrder with the following property of LocalOrder.

(LOCAL ORDER) If a process broadcasts amessage m and a process delivers mbefore broadcasting m, then no correctprocess delivers m, unless it has previ-ously delivered m.

Note 5 State-machine approach. A totalorder broadcast ensuring causal order

3Lamport initially called the relation happened be-fore [Lamport 1978b], but he renamed it precedesin later work [Lamport 1986a, 1986b].


8 X. Defago et al.

is, for instance, required by the statemachine approach [Lamport 1978a;Schneider 1990]. However, we think thatsome applications may require causality,some others not.

2.6.3. Source Ordering. Some papers(e.g., Garcia-Molina and Spauster [1991]and Jia [1995]) make a distinction be-tween single source and multiple sourceordering. These papers define singlesource ordering algorithms as algorithmsthat ensure total order only if a singleprocess broadcasts messages. This is aspecial case of FIFO broadcast, easilysolved using sequence numbers. Sourceordering is not particularly interestingin itself, and hence we do not discuss theissue further in this article.

3. PROPERTIES OF DESTINATION GROUPS

So far, we have presented the problem oftotal order broadcast, wherein messagesare sent to all processes in the system.In other words, all valid messages are ad-dressed to the entire system:

m M (Dest(m) = ). (3)

A multicast primitive is more general inthe sense that it can send messages to anychosen subset of the processes. In otherwords, we can have two valid messagessent to different destinations sets, or thedestination set may not include the mes-sage sender:

m M (sender(m) Dest(m)) mi, m j M (Dest(mi) = Dest(m j )). (4)

Although in wide use, the distinctionbetween broadcast and multicast is notprecise enough. This leads us to discussa more relevant distinction, namely, be-tween closed versus open groups, and be-tween single versus multiple groups.

3.1. Closed Versus Open Groups

In the literature, many algorithms are de-signed with the implicit assumption that

messages are sent within a group of pro-cesses. This originally came from the factthat early work on this topic was done inthe context of parallel machines [Lamport1978a], or highly available storage sys-tems [Cristian et al. 1995]. However, mostdistributed applications are now devel-oped by considering more open interactionmodels, such as the client-server model,N -tier architectures, or publish/subscribe.For this reason, it is necessary for a pro-cess to be able to multicast messages to agroup to which it does not belong. Conse-quently, we consider it an important char-acteristic of algorithms that they be easilyadaptable to open interaction models.

3.1.1. Closed Group Algorithms. In closedgroups algorithms, the sending process isalways one of the destination processes:

m M (sender(m) Dest(m)). (5)

So, these algorithms do not allow externalprocesses (processes that are not membersof the group) to multicast messages to thedestination group.

3.1.2. Open Group Algorithms. Con-versely, open group algorithms allowany arbitrary process in the system tomulticast messages to a group, whetheror not the sender process belongs to thedestination group. More precisely, thereare some valid messages where the senderis not one of the destinations:

m M (sender(m) Dest(m)). (6)

Open group algorithms are more generalthan closed group algorithms: the formercan be used with closed groups, while theopposite is not true.

3.2. Single Versus Multiple Groups

Most algorithms presented in the litera-ture assume that all messages are multi-cast to one single group of destination pro-cesses. Nevertheless, a few algorithms aredesigned to support multiple groups. Inthis context, we consider three situations:single group, multiple disjoint groups, and



multiple overlapping groups. We also dis-cuss how useless, trivial solutions can beruled out with the notion of minimality.Since the ability to multicast messages tomultiple destination sets is critical for cer-tain classes of applications, we regard thisability as an important characteristic of analgorithm.

3.2.1. Single Group Ordering. With singlegroup ordering, all messages are multi-cast to one single group of destination pro-cesses. As mentioned above, this is themodel considered by a vast majority of thealgorithms that are studied in this article.Single group ordering can be defined bythe following property:4

mi, m j M (Dest(mi) = Dest(m j )). (7)

3.2.2. Multiple Groups Ordering (Disjoint).In some applications, the restriction to onesingle destination group is not acceptable.For this reason, algorithms have been pro-posed that support multicasting messagesto multiple groups. The simplest case oc-curs when the multiple groups are disjointgroups. More precisely, if two valid mes-sages have different destination sets, thenthese sets do not intersect:

mi, m j M (Dest(mi) = Dest(m j ) Dest(mi) Dest(m j ) = ). (8)

Adapting algorithms designed for onesingle group to work in a system with mul-tiple disjoint groups is almost trivial.

3.2.3. Multiple Groups Ordering (Overlap-ping). In case of multiple groups order-ing, it can happen that groups overlap.This can be expressed by the fact thatsome pairs of valid messages have dif-ferent destination sets with a nonempty

4This definition and the following ones are static.They do not take into account the fact that processescan join groups and leave groups. Nevertheless, weprefer these simple static definitions, rather thanmore complex ones that would take dynamic desti-nation groups into account.

intersection:

mi, m j M (Dest(mi) = Dest(m j ) Dest(mi) Dest(m j ) = ). (9)

The real difficulty of designing totalorder multicast algorithms for multiplegroups arises when the groups can over-lap. This is easily understood when oneconsiders the problem of ensuring total or-der at the intersection of groups. In thiscontext, Hadzilacos and Toueg [1994] givethree different properties for total orderin the presence of multiple groups: Lo-cal Total Order, Pairwise Total Order, andGlobal Total Order.5

(LOCAL TOTAL ORDER) If correct processesp and q both TO-deliver messages mand m and Dest(m) = Dest(m), then pTO-delivers m before m, if and only if qTO-delivers m before m.

Local Total Order is the weakest of thethree properties. It requires that total or-der be enforced only for messages that aremulticast within the same group.

Note also that multiple unrelatedgroups can be considered as disjointgroups even if they overlap. Indeed, des-tination processes belonging to the inter-section of two groups can be seen as havingtwo distinct identities, one for each group.It follows that an algorithm for distinctmultiple groups can be trivially adaptedto support overlapping groups with LocalTotal Order.

As pointed out by Hadzilacos and Toueg[1994], the total order multicast primi-tive of the first version of Isis [Birmanand Joseph 1987] guaranteed Local TotalOrder.6

(PAIRWISE TOTAL ORDER) If two correct pro-cesses p and q both TO-deliver mes-sages m and m, then p TO-delivers m

5The ordering properties cited here are subject tocontamination, see Section 2.5. Contamination canbe avoided by formulating these properties similarlyto the Gap-free Uniform Total Order property.6It should be noted that, if the transformation is triv-ial from a conceptual point of view, the implementa-tion was certainly a totally different matter, espe-cially in the mid-80s.


10 X. Defago et al.

before m, if and only if q TO-delivers mbefore m.

Pairwise Total Order is strictly strongerthan Local Total Order. Most notably, itrequires that total order be enforced forall messages delivered at the intersectionof two groups.

As far as we know, there is no straight-forward algorithm to transform a totalorder multicast algorithm that enforcesLocal Total Order into one that alsoguarantees Pairwise Total Order (exceptfor trivial solutions; see Section 3.2.4).Hadzilacos and Toueg [1994] observe that,for instance, Pairwise Total Order is theorder property guaranteed by the al-gorithm of Garcia-Molina and Spauster[1989, 1991].

Pairwise Total Order alone may leadto unexpected situations when there arethree or more overlapping destinationgroups. For instance, Fekete [1993] illus-trates the problem with the following sce-nario. Consider three processes pi, pj , pk ,and three messages m1, m2, m3 that arerespectively sent to three different over-lapping groups G1 = {pi, pj }, G2 ={pj , pk}, and G3 = {pk , pi}. Pairwise To-tal Order allows the following histories onpi, pj , pk :

pi : TO-deliver(m3) TO-deliver(m1)

pj : TO-deliver(m1) TO-deliver(m2)

pk : TO-deliver(m2) TO-deliver(m3)

This situation is prevented by the speci-fication of Global Total Order [Hadzilacosand Toueg 1994], which is defined as fol-lows:

(GLOBAL TOTAL ORDER) The relation < isacyclic, where < is defined as follows:m < m if and only if any correct pro-cess delivers m and m, in that order.

Note 6. Fekete [1993] gives anotherspecification for total order multicastwhich also prevents the scenario just men-tioned. The specification, called AMC, is

expressed as an I/O automaton [Lynch andTuttle 1989; Lynch 1996] and uses the no-tion of pseudo-time to impose an order onthe delivery of messages.

3.2.4. Minimality and Trivial Solutions. Anyalgorithm that solves the problem of totalorder broadcast in a single group can eas-ily be adapted to solve the problem for mul-tiple groups with the following approach:

(1) form a super-group with the union ofall destination groups;

(2) whenever a message m is multicast to agroup, multicast it to the super-group,and

(3) processes not in Dest(m) discard m.

The problem with this approach is its in-herent lack of scalability. Indeed, in verylarge distributed systems, even if the des-tination groups are individually small,their union is likely to cover a very largenumber of processes.

To avoid this sort of solution, Guerraouiand Schiper [2001] require the implemen-tation of total order multicast for multiplegroups to satisfy the following minimalityproperty:

(STRONG MINIMALITY) The execution of thealgorithm implementing total ordermulticast for a message m involvesonly sender(m), and the processes inDest(m).

This property is often too strong: it disal-lows many interesting algorithms that usea small number of external processes formessage-ordering (e.g., algorithms whichdisseminate messages along some propa-gation tree). A weaker property would al-low an algorithm to involve a small set ofexternal processes.

3.2.5. Transformation Algorithm. Delporte-Gallet and Fauconnier [2000] propose ageneric algorithm that transforms a to-tal order broadcast algorithm for a singleclosed group into one for multiple groups.The algorithm splits destination groupsinto smaller entities and supports multi-ple groups with Strong Minimality.



3.3. Dynamic Groups

The specification in Section 2 is the stan-dard specification of total order broadcastin a static system, that is, a system inwhich all processes are created at systeminitialization. In practice, however, it is of-ten desirable that processes join and leavegroups at runtime.

A dynamic group is a group of processeswith a membership that can change dur-ing the computation: processes can dy-namically join or leave the group, or canbe removed from the group (removal inthe face of failures is discussed later inSection 6.2). With a dynamic group, thesuccessive memberships of the group arecalled the views of the group [Chockler,Keidar, and Vitenberg 2001].

With dynamic groups, the basic com-munication abstraction is called view syn-chrony, which can be seen as the coun-terpart of reliable broadcast in static sys-tems. Reliable broadcast is defined bythe Validity, Agreement, and Uniform In-tegrity properties of Section 2. Roughlyspeaking, View synchrony adopts a sim-ilar definition, while relaxing the Agree-ment property.7 Total order broadcast in asystem with dynamic groups can thus bespecified as view synchrony, plus a prop-erty of total order.

3.4. Partitionable Groups

In a widearea network, the network cantemporarily become partitioned; that is,some of the nodes can no longer com-municate, as all links between them arebroken. When this happens, destinationgroups can be split into several isolatedsubgroups (or partitions). There are twomain approaches to coping with parti-tioned groups: (1) the primary partitionmembership, and (2) the partitionablemembership.

With the primary partition member-ship, one of the partitions is recognizedas the primary partition.8 Only processes

7Discussing this primitive in detail is beyond thescope of this survey (see paper by Chockler, Keidar,and Vitenberg [2001] for details).8A simple way to do this is to recognize as primary

that belong to the primary partition are al-lowed to deliver messages, while the otherprocesses must wait until they can mergeback with the primary partition.

In contrast, the partitionable groupmembership allows all processes to de-liver messages, regardless of the partitionthey belong to. Doing so requires adaptingthe specification of total order broadcast.Chockler, Keidar, and Vitenberg [2001]define three order properties in a parti-tionable system: Strong Total Order (mes-sages are delivered in the same order byall processes that deliver them), Weak To-tal Order (the order requirement is re-stricted within a view), and Reliable To-tal Order (extends the Strong Total Orderproperty to require processes to deliver aprefix of a common sequence of messageswithin each view). In other words, withonly slight differences, Strong Total Or-der corresponds to the Uniform Total Or-der property of Section 2.3, and ReliableTotal Order to the Prefix Ordering prop-erty of Section 2.5. Other properties, suchas Validity, are also defined differently inpartitionable systems. This is explained inconsiderably more detail by Chockler, Kei-dar, and Vitenberg [2001] and Fekete et al.[2001].

4. MECHANISMS FOR MESSAGEORDERING

In this section, we propose a classificationof total order broadcast algorithms in theabsence of failures. The first question thatwe ask is: who builds the order? Morespecifically, we are interested in the entitythat generates the information necessaryfor defining the order of messages (e.g.,timestamp or sequence number).

We identify three different roles thata participating process can take with re-spect to the algorithm: sender, destination,or sequencer. A sender process is a pro-cess ps from which a message originates(i.e., ps sender). A destination process is

partition only one which retains a majority of theprocesses from the previous view. This does not en-sure that a primary partition always exists, but itguarantees that, if one exists, it is unique.


12 X. Defago et al.

Fig. 3. Classes of total order broadcastalgorithms.

a process pd to which a message is sent(i.e., pd dest). Finally, a sequencer pro-cess is not necessarily a sender or a desti-nation, but is somehow involved in the or-dering of messages. A given process maysimultaneously take several roles (e.g.,sender and sequencer and destination).However, we represent these roles sepa-rately as they are conceptually different.

According to the three different roles,we define three basic classes for total or-der broadcast algorithms, depending onwhether the order is respectively builtby a sequencer, the sender, or destina-tion processes. Among algorithms of thesame class, significant differences remain.To account for this problem, we intro-duce a further division, leading to fivesubclasses in total. These classes arenamed as follows (see Figure 3): fixedsequencer, moving sequencer, privilege-based, communication history, and des-tinations agreement. Privilege-based andmoving sequencer algorithms are com-monly referred to as token-based algo-rithms.

The terminology defined in this arti-cle is partly borrowed from other au-thors. For instance, communication his-tory and fixed sequencer were proposedby Cristian and Mishra [1995]. The termprivilege-based was suggested by DahliaMalkhi in a private discussion. Finally,Le Lann and Bres [1991] group algorithmsinto three classes, based on where the or-der is built. Unfortunately, their definition

Fig. 4. Fixed sequencer algorithms.

of classes is specific to a client-server ar-chitecture.

In the remainder of this section, wepresent each of the five classes and il-lustrate each class with a simple algo-rithm. The algorithms are merely pre-sented for the purpose of illustrating thecorresponding category, and should not beregarded as full-fledged working exam-ples. Although inspired by existing algo-rithms, they are largely simplified, andnone of them is fault-tolerant.

Note 7 Atomic blocks. The algorithmsare written in pseudocode, with the as-sumption that blocks associated with awhen-clause are executed atomically withrespect to two when-clauses of the sameprocess, except when a process is blockedon a wait statement. This assumptiongreatly simplifies the expression of the al-gorithms with respect to concurrency.

4.1. Fixed Sequencer

In a fixed sequencer algorithm, one pro-cess is elected as the sequencer and is re-sponsible for ordering messages. The se-quencer is unique, and the responsibilityis not normally transfered to another pro-cesses (at least in the absence of failure).

The approach is illustrated in Figure 4and Figure 5. One specific process takesthe role of a sequencer and builds the totalorder. To broadcast a message m, a sendersends m to the sequencer. Upon receiv-ing m, the sequencer assigns it a sequencenumber and relays m with its sequencenumber to the destinations. The latterthen deliver messages according to the se-quence numbers. This algorithm does nottolerate the failure of the sequencer.

In fact, three variants of fixed se-quencer algorithms exist. We call thesethree variants UB (unicast-broadcast),



Fig. 5. Simple fixed sequencer algorithm.

Fig. 6. Common variants of fixed sequencer algorithms.

BB (broadcast-broadcast), and UUB(unicast-unicast-broadcast), taking inspi-ration from Kaashoek and Tanenbaum[1996].

In the first variant, called UB (seeFigure 6(a)), the protocol consists of a uni-cast to the sequencer, followed by a broad-cast from the sequencer. This variant gen-erates few messages, and it is the simplestof the three approaches. It is, for instance,adopted by Navaratnam et al. [1988], andcorresponds to the algorithm in Figure 5.

In the second variant, called BB(Figure 6(b)), the protocol consists of abroadcast to all destinations plus the se-quencer, followed by a second broadcastfrom the sequencer. This generates more

messages than the previous approach, ex-cept in broadcast networks. However, itcan reduce the load on the sequencer, andmakes it easier to tolerate the crash of thesequencer. Isis (sequencer) [Birman et al.1991] is an example of the second variant.

The third variant, called UUB(Figure 6(c)), is less common than theothers. In short, the protocol consists ofthe following steps. The sender requestsa sequence number from the sequencer(unicast). The sequencer replies with asequence number (unicast). Then, thesender broadcasts the sequenced messageto the destination processes.9

9The protocol to tolerate failures is complex.


14 X. Defago et al.

Fig. 7. Moving sequencer algorithms.

4.2. Moving Sequencer

Moving sequencer algorithms are based onthe same principle as fixed sequencer al-gorithms, but allow the role of sequencerto be transferred between several pro-cesses. The motivation is to distribute theload among them. This is illustrated inFigure 7, where the sequencer is cho-sen among several processes. The codeexecuted by each process is, however,more complex than with a fixed sequencer,which explains the popularity of the lat-ter approach. Notice that with moving se-quencer algorithms, the roles of sequencerand destination processes are normallycombined.

Figure 8 shows the principle of mov-ing sequencer algorithms. To broadcasta message m, a sender sends m to thesequencers. Sequencers circulate a tokenmessage that carries a sequence num-ber and a list of all messages to whicha sequence number has been attributed(i.e., all sequenced messages). Upon re-ceipt of the token, a sequencer assigns asequence number to all received, yet unse-quenced, messages. It sends the newly se-quenced messages to the destinations, up-dates the token, and passes it to the nextsequencer.

Note 8. Similar to fixed sequencer algo-rithms, it is possible to develop a movingsequencer algorithm according to one ofthree variants. However, the difference be-tween the variants is not as clear cut as itis for a fixed sequencer. It turns out that allof the moving sequencer algorithms sur-veyed follow the equivalent of the fixed se-quencer variant BB. Hence, we do not dis-cuss this issue any further.

Note 9. As mentioned, the main motiva-tion for using a moving sequencer is to dis-tribute the load among several processes,

thus avoiding the bottleneck caused by asingle process. This is illustrated in sev-eral studies (e.g., Cristian et al. [1994]and Urban et al. [2000]). One could thenwonder why a fixed sequencer algorithmshould be preferred to a moving sequenceralgorithm. There are, in fact, at least threepossible reasons. First, fixed sequencer al-gorithms are considerably simpler, leavingless room for implementation errors. Sec-ond, the latency of fixed sequencer algo-rithms is often better, as shown by Urbanet al. [2000]. Third, it is often the casethat some machines are more reliable,more trusted, better connected, or simplyfaster than others. When this is the case, itmakes sense to use one of them as a fixedsequencer (see MTP in Section 7.1.2).

4.3. Privilege-Based

Privilege-based algorithms rely on theidea that senders can broadcast mes-sages only when they are granted theprivilege to do so. Figure 9 illustratesthis class of algorithms. The order is de-fined by the senders when they broadcasttheir messages. The privilege to broad-cast (and order) messages is granted toonly one process at a time, but this priv-ilege circulates from process-to-process,among the senders. In other words, dueto the arbitration between senders, build-ing the total order requires solving theproblem of FIFO broadcast (easily solvedwith sequence numbers at the sender),and ensuring that passing the privilegeto the next sender does not violate thisorder.

Figure 10 illustrates the principleof privilege-based algorithms. Senderscirculate a token message that carries asequence number to be used when broad-casting the next message. When a pro-cess wants to broadcast a message m, itmust first wait until it receives the to-ken message. Then, it assigns a sequencenumber to each of its messages and sendsthem to all destinations. Following this,the sender updates the token and sends itto the next sender. Destination processesdeliver messages in increasing sequencenumbers.



Fig. 8. Simple moving sequencer algorithm.

Note 10. In privilege-based algorithms,senders usually need to know each other inorder to circulate the privilege. This con-straint makes privilege-based algorithmspoorly suited to open groups, where thereis no fixed and previously known set ofsenders.

Note 11. In synchronous systems,privilege-based algorithms are basedon the idea that each sender process isallowed to send messages only duringpredetermined time slots. These timeslots are attributed to each process insuch a way that no two processes can sendmessages at the same time. By ensur-ing that the communication medium isaccessed in mutual exclusion, the totalorder is easily guaranteed. The techniqueis also known as time division multipleaccess (TDMA).

Note 12. It is tempting to consider thatprivilege-based and moving sequencer al-gorithms are equivalent, since both rely

Fig. 9. privilege-based algorithms.

on a token passing mechanism. However,they differ in one significant aspect: thetotal order is built by senders in privilege-based algorithms, while it is built by se-quencers in moving sequencer algorithms.This has at least two major consequences.First, moving sequencer algorithms areeasily adapted to open groups. Second, inprivilege-based algorithms, the passing ofthe token is necessary to ensure the live-ness of the algorithm, while with movingsequencer algorithms, it is mostly used for


16 X. Defago et al.

Fig. 10. Simple privilege-based algorithm.

improving performance, for example, bydoing load balancing.

Note 13. It is difficult to ensure fairnesswith privilege-based algorithms. Indeed,if a process has a very large number ofmessages to broadcast, it could keep thetoken for an arbitrarily long time, thuspreventing other processes from broad-casting their own messages. To overcomethis problem, algorithms often enforce anupper limit on the number of messagesand/or the time that some process can keepthe token. Once the limit is passed, theprocess is compelled to release the token,regardless of the number of messages re-maining to be broadcast.

4.4. Communication History

In communication history algorithms, asin privilege-based algorithms, the deliveryorder is determined by the senders. How-ever, in contrast to privilege-based algo-rithms, processes can broadcast messagesat any time, and total order is ensuredby delaying the delivery of messages. Themessages usually carry a (physical or log-

ical) timestamp. The destinations observethe messages generated by the other pro-cesses and their timestamps, that is, thehistory of communication in the system,to learn when delivering a message willno longer violate the total order.

There are two fundamentally differ-ent variants of communication historyalgorithms. In the first variant, calledcausal history, communication history al-gorithms use a partial order, defined bythe causal history of messages, and trans-form this partial order into a total order.Concurrent messages are ordered accord-ing to some predetermined function. Inthe second variant, known as determinis-tic merge, processes send messages times-tamped independently (thus not reflectingcausal order), and delivery takes place ac-cording to a deterministic policy of merg-ing the streams of messages coming fromeach process.

Figure 11 illustrates a typical communi-cation history algorithm of the first vari-ant. The algorithm, inspired by Lamport[1978b], works as follows. The algorithmuses logical clocks [Lamport 1978b] to



Fig. 11. Simple communication history algorithm (causal history).

Fig. 12. Simple communication history algorithm (deterministic merge).

timestamp each message m with the log-ical time of the TO-broadcast(m) event, de-noted ts(m). Messages are then deliveredin the order of their timestamps. However,we can have two messages, m and m, withthe same timestamp. To arbitrate betweenthese messages, the algorithm uses thelexicographical order on the identifiers ofsending processes. In Figure 11, we referto this order as the (ts(m), sender(m)) or-der, where sender(m) is the identifier of thesender process.

A simple example of the second variantis illustrated in Figure 12. The algorithmassumes that communication is FIFO, andthat sender processes broadcast messagesat the same rate. Destination processes ex-ecute an infinite loop where they accept,in a round-robin fashion, a single messagefrom each sender process. Aguilera andStrom [2000] (Section 7.4.9), for instance,propose a more elaborate algorithm basedon the same principle.

Note 14. The algorithms of Figure 11and Figure 12 are not live. Indeed, con-sider the algorithm of Figure 11 and ascenario where a single process p broad-casts a single message m, while no otherprocess ever broadcasts any message. Ac-cording to the algorithm in Figure 11, aprocess q can deliver m only after it hasreceived, from every process, a messagethat was broadcast after the receptionof m. This is, of course, impossible if atleast one of the processes never broad-casts any message. To overcome this prob-lem, communication history algorithmsproposed in the literature usually sendempty messages when no application mes-sages are broadcast.

Note 15. In synchronous systems, com-munication history algorithms rely onsynchronized clocks, and use physicaltimestamps (timestamps coming from thesynchronized clocks), instead of logicalones. The nature of such systems makes


18 X. Defago et al.

Fig. 13. Destinations agreement algorithms.

it unnecessary to send empty messages inorder to ensure liveness. Indeed, this canbe seen as an example of the use of time tocommunicate [Lamport 1984].

4.5. Destinations Agreement

In destinations agreement algorithms, asthe name indicates, the delivery order re-sults from an agreement between des-tination processes (see Figure 13). Wedistinguish three different variants ofagreement: (1) agreement on a messagesequence number, (2) agreement on a mes-sage set, or (3) agreement on the accep-tance of a proposed message order.

Figure 14 illustrates an algorithm ofthe first variant: for each message, thedestination processes reach an agreementon a unique (yet not consecutive) se-quence number. The algorithm is adaptedfrom Skeens algorithm (Section 7.5.1), al-though it operates in a decentralized man-ner. Briefly, the algorithm works as fol-lows. To broadcast a message m, a sendersends m to all destinations. Upon receiv-ing m, a destination assigns it a localtimestamp and sends this timestamp toall destinations. Once a destination pro-cess has received a local timestamp form from all destinations, a unique globaltimestamp sn(m) is assigned to m, calcu-lated as the maximum of all local times-tamps. Messages are delivered in the orderof their global timestamp, that is, a mes-sage m can only be delivered once it hasbeen assigned its global timestamp sn(m),and no other undelivered message mcan possibly receive a timestamp sn(m)smaller or equal to sn(m). As with the com-munication history algorithm (Figure 11),the identifier of the message sender is usedto break ties between messages with thesame global timestamp.

The most representative algorithm ofthe second variant of agreement is the al-gorithm proposed by Chandra and Toueg[1996] (Section 7.5.4). The algorithmtransforms total order broadcast into a se-quence of consensus problems.10 Each in-stance of the consensus decides on a setof messages to deliver, that is, consensusnumber k allows the processes to agreeon a set Msgk of messages. For k < k,the messages in Msgk are delivered beforethe messages in Msgk

. The messages in a

set Msgk are delivered according to somepredetermined order (e.g., in lexical orderof their identifiers).

With the third variant of agreement, atentative message delivery order is firstproposed (usually by one of the desti-nations). Then, the destination processesmust agree to either accept or reject theproposal. In other words, this variantof destinations agreement relies on anatomic commitment protocol. The algo-rithm proposed by Luan and Gligor [1990]typically belongs to the third variant.

Note 16. There is a thin line betweenthe second and the third variants of agree-ment. For instance, Chandra and Touegs[1996] total order broadcast algorithm re-lies on consensus, as described. However,when it is combined with the rotating co-ordinator consensus algorithm [Chandraand Toueg 1996], the resulting algorithmcan be seen as an algorithm of the thirdform. Indeed, the coordinator proposes atentative order (given as a set of messageplus message identifiers) that it tries tovalidate. Thus it is important to note thattwo seemingly identical algorithms mayuse different forms of agreement, simplybecause they are described at different lev-els of abstraction.

4.6. Time-Free Versus Time-Based Ordering

We introduce a further distinction be-tween algorithms, orthogonal to the above

10The consensus problem is informally defined asfollows: every process proposes some value, and allprocesses must eventually decide on the same value,which must be one (any one) of the proposed values.



Fig. 14. Simple destinations agreement algorithm.

classification. The distinction is betweenalgorithms that use physical time for mes-sage ordering, and algorithms that donot use physical time. For instance, inSection 4.4 (see Figure 11), we presented asimple communication-history algorithmbased on logical time. It is indeed pos-sible to design a similar algorithm thatuses the physical time (and synchronizedclocks) instead.

In short, we distinguish algorithms withtime-based ordering, that rely on physicaltime, and algorithms with time-free order-ing that do not use physical time.

5. CONCEPTUAL ISSUES RELATED TOFAILURES

In Section 4, we discussed ordering mech-anisms, ignoring the problem of failures.Mechanisms for fault-tolerance are dis-cussed below in Section 6. However, fault-tolerance cannot be discussed withoutsome prior discussion on system model is-sues. This is done in this section.

5.1. Synchrony and Timeliness

The synchrony of a system defines thetiming assumptions that are made on thebehavior of processes and communicationchannels. More specifically, one usuallyconsiders two major parameters. The firstparameter is the process speed inter-val, which is given by the difference be-tween the speed of the slowest and thefastest processes in the system. The sec-ond parameter is the communication de-lay, which is given by the time elapsedbetween the sending and the receipt ofmessages. The synchrony of the system isdefined by considering various bounds onthese two parameters.

A system where both parameters havea known upper bound is called a syn-chronous system. At the other extreme, asystem in which process speed and com-munication delays are unbounded is calledan asynchronous system. Between thosetwo extremes lie the definition of vari-ous partially synchronous system models[Dolev et al. 1987; Dwork et al. 1988].


20 X. Defago et al.

A third model that is considered by sev-eral total order broadcast algorithms isthe timed asynchronous model defined byCristian and Fetzer [1999]. In its mostsimple form, this model can be seen asan asynchronous model with the notionof physical time and an assumption thatmost messages are likely to reach theirdestination within a known delay [Cris-tian et al. 1997; Cristian and Fetzer 1999].

5.2. Impossibility Results

There is an important theoretical resultrelated to the consensus problem (seeFootnote 10). It has been proven that thereis no deterministic solution to the prob-lem of consensus in asynchronous systemsif just a single process can crash [Fischeret al. 1985]. Dolev et al. [1987] have shownthat total order broadcast can be trans-formed into consensus, thus proving thatthe impossibility of consensus also holdsfor total order broadcast. These impossi-bility results were the motivation to ex-tend the asynchronous system with theintroduction of oracles to make consensusand total order broadcast solvable.11

5.3. Oracles

In short, a (distributed) oracle can be seenas some component that processes canquery. An oracle provides information thatalgorithms can use to guide their choices.The oracles most frequently considered indistributed systems are failure detectorsand coin flips. Since the information pro-vided by these oracles make consensusand total order broadcast solvable, theyaugment the power of the asynchronoussystem model.

5.3.1. Failure Detectors. A failure detec-tor is an oracle that provides informa-tion about the current status of processes,

11Chandra and Toueg [1996] show that consensuscan be transformed into total order broadcast. Theresult holds also for arbitrary failures. So, consensusand total order broadcast are equivalent problems,that is, if there exists an algorithm that solves oneproblem, then it can be transformed into an algo-rithm that solves the other problem.

for instance, whether a given process hascrashed or not.

The notion of failure detectors has beenformalized by Chandra and Toueg [1996].Briefly, a failure detector is modeled as aset of distributed modules, one module FDiattached to each process pi. Any process pican query its failure detector module FDiabout the status of other processes.

Failure detectors may be unreliable, inthe sense that they provide informationthat may not always correspond to the realstate of the system. For instance, a failuredetector module FDi may provide the er-roneous information that some process pjhas crashed while, in reality, pj is correctand running. Conversely, FDi may providethe information that a process pk is cor-rect, while pk has actually crashed.

To reflect the unreliability of the infor-mation provided by failure detectors, wesay that a process pi suspects some pro-cess pj whenever FDi, the failure detectormodule attached to pi, returns the (unreli-able) information that pj has crashed. Inother words, a suspicion is a belief (e.g., pibelieves that pj has crashed) as opposedto a known fact (e.g., pj has crashed andpi knows that).

There exist several classes of failure de-tectors, depending on how unreliable theinformation provided by the failure de-tector can be. Classes are defined by twoproperties, called completeness and accu-racy, that constrain the range of possi-ble mistakes. In this article, we considerfour different classes of failure detectors,called P (perfect), P (eventually perfect),S (strong), and S (eventually strong).The four classes share the same propertyof completeness, and only differ by theiraccuracy property [Chandra and Toueg1996]:

(STRONG COMPLETENESS) Eventually everyfaulty process is permanently sus-pected by all correct processes.

(STRONG ACCURACY) No process is sus-pected before it crashes. [class P]

(EVENTUAL STRONG ACCURACY) There is atime after which correct processes arenot suspected by any correct process.[class P]



(WEAK ACCURACY) Some process is neversuspected. [class S]

(EVENTUAL WEAK ACCURACY) There is atime after which some correct processis never suspected by any correctprocess. [class S]A failure detector of class S with a

majority of correct processes allows usto solve consensus [Chandra and Toueg1996]. Moreover, Chandra et al. [1996]have shown that a failure detector ofclass S is the weakest failure detectorthat allows us to solve consensus.12

5.3.2. Random Oracle. Another ap-proach to extend the power of the asyn-chronous system model is to introducethe ability to generate random values.For instance, processes could have accessto a module that generates a random bitwhen queried (i.e., a Bernoulli randomvariable).

This approach is used by a class of al-gorithms called randomized algorithms.These algorithms can solve problems suchas consensus (and so total order broadcast)in a probabilistic manner. The probabil-ity that such algorithms terminate beforesome time t, goes to one, as t goes to in-finity (e.g., Ben-Or [1983] and Chor andDwork [1989]). Note that solving a prob-lem deterministically and solving it withprobability 1 are not the same.

5.4. Uniformity for Free

In Section 2, we explained the differencebetween uniform and nonuniform specifi-cations. Guerraoui [1995] shows that anyalgorithm that solves Consensus with P(S, S, respectively), also solves UniformConsensus with P (S, S, respectively).

It is easy to show that this result alsoholds for total order broadcast. Assumethat there exists an algorithm that solvesnonuniform total order broadcast (nonuni-form Agreement, nonuniform Total Order)

12The weakest failure detector to solve consensus isusually said to beW, which differs fromS by satis-fying a weak completeness property instead of StrongCompleteness. However, Chandra and Toueg [1996]prove the equivalence of S and W.

with P, S, or S, but does not solveuniform total order broadcast. Using thetransformation of total order broadcastto consensus (see Section 5.2), this algo-rithm could be used to obtain an algo-rithm that solves nonuniform consensus,but not consensus. This is in contradic-tion to Guerraoui [1995]. Hence, we haveproven that enforcing uniformity has noadditional cost in the asynchronous mod-els with P, S, and S failure detectors.

Note however that the result does nothold for total order broadcast algorithmsthat rely on a perfect (P), or almost perfectfailure detector (see Section 5.5).

5.5. Process Controlled Crash

Process controlled crash is the abilitygiven to processes to kill other processes orto commit suicide. In other words, this isthe ability to artificially force the crash of aprocess. Allowing process controlled crashin a system model augments its power. In-deed, this makes it possible to transformsevere failures (e.g., omission, Byzantine)into less severe failures (e.g., crash), andto emulate an almost perfect failure de-tector. However, this power does not comewithout a price.

Automatic transformation of failures.Neiger and Toueg [1990] present a tech-nique that uses process controlled crashto transform severe failures (e.g., omis-sion, Byzantine) into less severe ones (i.e.,crash failures). In short, the techniqueis based on the idea that processes havetheir behavior monitored. Then, whenevera process begins to behave incorrectly (e.g.,omission, Byzantine), it is killed.13

However, this technique cannot be usedin systems with lossy channels, or thosesubject to partitions. Indeed, in such con-texts, processes might end up killing eachother until not a single one is left alive inthe system.

Emulation of an almost perfect failuredetector. A perfect failure detector (P) sat-isfies both strong completeness and strongaccuracy (no process is suspected before

13The actual technique is more complex than what isdescribed here, but this gives the basic idea.


22 X. Defago et al.

it crashes [Chandra and Toueg 1996]). Inpractical systems, perfect failure detectorsare extremely hard to implement becauseof the difficulty in distinguishing crashedprocesses from very slow ones. The idea ofthe emulation is simple: whenever a fail-ure detector suspects a process p, then p iskilled (forced to crash). Fetzer [2003] pro-poses a different emulation, based on reli-able watchdogs, to ensure that no processis suspected before it crashes.

Cost of a free lunch. Process controlledcrash has a price. A fault-tolerant algo-rithm can only tolerate the crash of abounded number of processes. In a sys-tem with process controlled crash, thislimit includes not only genuine failures,but also failures provoked through processcontrolled crash. This means that eachprovoked failure effectively decreases thenumber of genuine failures that can be tol-erated, thus degrading the actual fault-tolerance of the system.

6. MECHANISMS FOR FAULT-TOLERANCE

The total order broadcast algorithms de-scribed in Section 4 are not tolerant tofailures: if a single process crashes, theproperties specified in Section 2.3 are notsatisfied. To be fault-tolerant, total or-der broadcast algorithms rely on vari-ous techniques presented in this section.Note that it is difficult to discuss thesetechniques without getting into specificimplementation details. Nevertheless, wetry to keep the discussion as generalas possible. Notice also that algorithmsmay actually combine several of thesetechniques, for example, failure detection(Section 6.1) with resilient communica-tion patterns (Section 6.3).

6.1. Failure Detection

A recurrent pattern in all distributed al-gorithms is for a process p to wait for amessage from some other process q. If qcrashes, process p is blocked. Failure de-tection is one basic mechanism to preventp from being blocked.

Unreliable failure detection has beenformalized by Chandra and Toueg [1996]

in terms of two properties: accuracy andcompleteness (see Section 5.3.1). Com-pleteness prevents the blocking problemjust mentioned. Accuracy prevents algo-rithms from running forever without solv-ing the problem.

Unreliable failure detectors might betoo weak for some total order broadcastalgorithms which require reliable failuredetection information provided by a per-fect failure detector, known as P (seeSection 5.5).

6.2. Group Membership Service

The low-level failure detection mechanismis not the only way to address the block-ing problem mentioned in the previoussection. Blocking can also be preventedby relying on a higher-level mechanism,namely a group membership service.

A group membership service is a dis-tributed service that is responsible formanaging the membership of groups ofprocesses (see Section 3.4 and survey byChockler, Keidar, and Vitenberg [2001]).The successive memberships of a groupare called the views of the group. When-ever the membership changes, the servicereports change to all group members byproviding them with the new view.

A group membership service usuallyprovides strong completeness: if a pro-cess p member of some group G crashes,the membership service provides to thesurviving members of G a new view fromwhich p is excluded. In the primary-partition model (see Section 3.4), the ac-curacy of failure notifications is ensuredby forcing the crash of processes thathave been incorrectly suspected and ex-cluded from the membership, a mecha-nism called process-controlled crash (seeSection 5.5).

Moreover in the primary-partitionmodel, the group membership serviceprovides consistent notifications to thegroup members: the successive views of agroup are notified in the same order to allof its members.

To summarize, while failure detectorsprovide inconsistent failure notifications,a group membership service provides



consistent failure notifications. Moreover,total order algorithms that rely on a groupmembership service for fault-tolerance,exploit another property that is usu-ally provided along with the member-ship service, namely view synchrony (seeSection 3.3). Roughly speaking, view syn-chrony ensures that between two suc-cessive views v and v, processes in theintersection v v deliver the same setof messages. Group membership serviceand view synchrony have been used toimplement complex group communica-tion systems (e.g., Isis [Birman and vanRenesse 1994], Totem [Moser et al. 1996],Transis [Dolev and Malkhi 1994, 1996;Amir et al. 1992], Phoenix [Malloth et al.1995; Malloth 1996]).

6.3. Resilient Communication Patterns

As shown in the previous sections, an algo-rithm can rely on a failure detection mech-anism, or on a group membership service,to avoid the blocking problem. To be fault-tolerant, another solution is to avoid anypotential blocking pattern.

Consider, for example, a process p wait-ing for n f messages, where n is the num-ber of processes in the system, and f themaximum number of processes that maycrash. If all correct processes send a mes-sage to p, then the above pattern is non-blocking. We call such a pattern a resilientpattern. If an algorithm uses only resilientpatterns, it avoids the blocking problemwithout using any failure detector mecha-nism or group membership service. Suchalgorithms have, for instance, been pro-posed by Rabin [1983], Ben-Or [1983], andPedone et al. [2002] (the first two are con-sensus algorithms, see Footnote 10).

6.4. Message Stability

Avoiding blocking is not the only problemthat fault-tolerant total order broadcastsalgorithms have to address. Figure 1 il-lustrates a violation of the Uniform Agree-ment property. Notice that this problem isunrelated to blocking.

The mechanism that solves the prob-lem is called message stability. A mes-

sage m is said to be k-stable, if m hasbeen received by k processes. In a systemin which at most f processes may crash,f +1-stability is the important property todetect. if some message m is f +1-stable,then m is received by at least one correctprocess. With such a guarantee, an algo-rithm can easily ensure that m is even-tually received by all correct processes.f +1-stability is often simply called stabil-ity. The detection of stability is generallybased on some acknowledgment scheme ortoken passing.

Another use for message stability is thereclaiming of resources. Indeed, when aprocess detects that a message has be-come stable throughout the system, itcan release resources associated with thatmessage.

6.5. Consensus

The mechanisms described so far are low-level mechanisms on which fault-toleranttotal broadcast algorithms may rely.

Another option for a fault-tolerant to-tal order broadcast algorithm is to relyon higher-level mechanisms that solve allthe problems related to fault-tolerance(i.e., the problems previously mentioned).The consensus problem (see Footnote 10)is such a mechanism. Some algorithmssolve total order broadcast by reducingit to a consensus problem. This way,fault-tolerance, including failure detectionand message stability detection, is hiddenwithin the consensus abstraction.

6.6. Mechanisms for Lossy Channels

Apart from the mechanisms used to tol-erate process crashes, we need to say afew words about mechanisms to toleratechannel failures. First, it should be men-tioned that several total order broadcastalgorithms avoid the issue by relying onsome communication layer that takes careof message loss (i.e., these algorithms as-sume reliable channels, and hence do notdiscuss message loss). In contrast, otheralgorithms are built directly on top of lossychannels, and so address message lossexplicitly.


24 X. Defago et al.

To address message loss, the stan-dard solution is to rely on a positive ora negative acknowledgment mechanism.With positive acknowledgment, the re-ceipt of messages is acknowledged; withnegative acknowledgment, the detectionof a missing message is signaled. The twoschemes can be combined.

Token-based algorithms (i.e., movingsequencer or privilege-based algorithms)rely on token passing to detect messagelosses: the token can be used to convey ac-knowledgments, or to detect missing mes-sages. So token-based algorithms use thetoken for ordering purpose, but also for im-plementing reliable channels.

7. SURVEY OF EXISTING ALGORITHMS

This section provides an extensive surveyof total order broadcast algorithms. Wepresent about sixty algorithms publishedin scientific journals or conference pro-ceedings over the past three decades. Wehave made every possible effort to be ex-haustive, and we are quite confident thatthis article presents a good picture of thefield at the time of writing. However, be-cause of the continuous flow of papers onthe subject, we might have overlooked onealgorithm or two.

In Tables IIIV, we present a condensedoverview of all surveyed algorithms, inwhich we summarize the important char-acteristics of each algorithm. The tablespresent only factual information about thealgorithms as it appears in the relevantpapers. In particular, the tables do notpresent information that is the result ofextrapolation, or nonobvious deduction;the exception is when we had to inter-pret information to overcome differencesin terminology. Also, properties that arediscussed in the original paper, yet notproved correct, are reported as informalin the tables. For the sake of conciseness,several symbols and abbreviations havebeen used throughout the tables. They areexplained in Table II. For each algorithm,Tables IIIV provide the following infor-mation:

(1) General information, that is, the or-dering mechanism (see Section 4), and

Table II. Abbreviations Used in Tables IIIV yes somewhat explained in the

text nospec. special explained in the

textinf. informal explained in the

textNS not specified means also not

discussedn/a not applicable+a positive acknowledgmenta negative acknowledgmentGM group membershipFD failure detector/detectionCons. consensusRCP resilient communication

patternsByzA. Byzantine agreement

whether the mechanism is time-based ornot (Section 4.6).

(2) The General information rows arefollowed by rows describing the assump-tions upon which the algorithm is based,that is, what is provided to it:

(a) The System model rows specify thesynchrony assumptions, the assump-tions made about process failures(rows: crash, omission, Byzantine), andcommunication channels (rows: reli-able, FIFO). Reliable channels guaran-tee that if a correct process p sends amessage m to a correct process q, thenq will eventually receive m [Aguileraet al. 1999]. The row partitionable in-dicates whether or not the algorithmworks with partitionable membershipsemantics (see Section 3.4). In particu-lar, algorithms in which only processesin a primary partition can work are notconsidered partitionable.

(b) The rows called Condition for livenessdiscuss the assumptions necessary toensure the liveness of the algorithm:The row live. . . X means that the

liveness of the algorithm requiresthe liveness of the building block X(on which the algorithm relies).For example, live. . . GM meansthat the algorithm is live, if thegroup membership building block onwhich the algorithm relies, is itselflive.



Tab

leIII

.O

verv

iew

ofTo

talO

rder

Bro

adca

stA

lgor

ithm

s(P

artI

)G

arci

a-M

.Is

isN

avar

at.

Ch

ang

Pin

-O

n-

Gop

alA

moe

baM

TP

Tan

dem

Spa

ust

erJi

a(s

eq.)

etal

.P

hoe

nix

Ram

part

Max

em.

RM

PD

TP

wh

eel

dem

and

Tra

inT

oken

-FD

Tot

emT

PM

Tou

egR

TC

AS

TM

AR

S

Alg

orit

hm

7.1

.17

.1.2

7.1

.37

.1.4

7.1

.57

.1.6

7.1

.77

.1.8

7.1

.97

.2.1

7.2

.27

.2.3

7.2

.47

.3.1

7.3

.27

.3.3

7.3

.47

.3.5

7.3

.67

.3.7

7.3

.8O

rder

ing

mec

han

ism

clas

sfi

xed

sequ

ence

rm

ovin

gse

quen

cer

priv

ileg

e-ba

sed

tim

e-ba

sed

S

yste

mm

odel

syn

chro

ny

asyn

chro

nou

sti

med

asyn

chro

nou

ssy

nch

.sy

nch

ron

ous

asyn

chro

nou

s(s

pec.

)sy

nch

.clo

cks

cras

h

omis

sion

Byz

anti

ne

pa

rtit

ion

able

re

liab

le

FIF

O

C

ond

itio

nn

eed

edfo

rli

ven

ess

live

...G

MG

MG

Mot

her

NS

N

Sre

cove

ryN

SN

SN

SN

SN

SN

SN

Ssp

ec.

NS

NS

n/a

n/a

n/a

Bu

ild

ing

bloc

ksvi

ewsy

nc.

re

liab

leb.

cau

salb

.

con

sen

sus

oth

erP

rope

rtie

sen

sure

dA

gree

men

tin

f.in

f.in

f.

NS

inf.

inf.

inf.

inf.

inf.

in

f.in

f.

NS

Un

if.A

N

S

/

/

N

ST

otal

Ord

erin

f.

inf.

inf.

inf.

inf.

inf.

in

f.in

f.in

f.

NS

Un

if.T

O

/

/

NS

FIF

Oor

der

cau

salo

rd.

Des

tin

atio

ngr

oups

mu

ltip

le

open

F

ault

tole

ran

cem

ech

anis

mpr

oces

sG

Msp

ec.

GM

bloc

k.G

MG

MG

MG

MG

MG

MG

MG

MG

MG

MG

MF

DG

MG

MC

ons.

GM

GM

com

m.

+a,

aa

+aa

an

/an

/an

/an

/a+a

,a

+a,

a+a

,a

a+a

,a

GM

n/a

a+a

,a

n/a

GM

n/a


26 X. Defago et al.

Tab

leIV

.O

verv

iew

ofTo

talO

rder

Bro

adca

stA

lgor

ithm

s(P

artI

I)N

ewto

pD

eter

m.

Red

un

d.Q

oSN

ewto

pIn

dulg

.O

pt.T

OL

ampo

rtP

syn

c(s

ym.)

Ng

ToT

oT

otal

AT

OP

CO

ReL

mer

geH

AS

chan

.Q

uic

k-S

AB

PA

tom

pres

erv.

(asy

m.)

Hyb

rid

un

if.T

Oin

WA

N

Alg

orit

hm

7.4

.17

.4.2

7.4

.37

.4.4

7.4

.57

.4.6

7.4

.77

.4.8

7.4

.97

.4.1

07

.4.1

17

.4.1

27

.4.1

37

.4.1

47

.4.1

57

.6.1

7.6

.27

.6.3

7.6

.4O

rder

ing

mec

han

ism

clas

sco

mm

un

icat

ion

his

tory

hyb

rid

tim

e-ba

sed

S

yste

mm

odel

syn

chro

ny

asyn

chro

nou

ssy

nch

ron

ous

spec

.sy

nc.

syn

chro

nou

sas

ynch

ron

ous

syn

c.cl

ocks

syn

c.cl

ocks

cras

h

omis

sion

/

Byz

anti

ne

/

/

pa

rtit

ion

able

reli

able

n

/a

F

IFO

n/a

Con

dit

ion

nee

ded

for

live

nes

sli

ve...

GM

Con

s.ot

her

NS

NS

NS

NS

NS

spec

.n

/an

/an

/an

/an

/an

/aN

SN

SN

SB

uil

din

gbl

ocks

view

syn

c.

re

liab

leb.

cau

salb

.

con

sen

sus

oth

ersp

ec.

Byz

A.

Pro

pert

ies

ensu

red

Agr

eem

ent

inf.

in

f.

inf.

Un

if.A

n/a

/n

/an

/a

n/a

/

n/a

Tot

alO

rder

inf.

inf.

in

f.

in

f.

inf.

in

f.U

nif

.TO

n/a

/

n

/a

/

FIF

Oor

der

ca

usa

lord

.

D

esti

nat

ion

grou

psm

ult

iple

op

en??

??F

ault

tole

ran

cem

ech

anis

mpr

oces

sn

/aF

DG

MF

DG

MR

CP

GM

GM

n/a

RC

PR

CP

Byz

A.

GM

GM

GM

GM

GM

Con

s.N

Sco

mm

.n

/a+a

n/a

n/a

n/a

+a,

an

/an

/an

/afl

ood.

spec

.n

/a+a

n/a

n/a

n/a

n/a

n/a

n/a



Tab

leV.

Ove

rvie

wof

Tota

lOrd

erB

road

cast

Alg

orith

ms

(Par

tIII)

.L

uan

Le

Lan

nC

han

dra

Rod

rigu

esS

cal-

Fri

tzke

opti

m.

prefi

xge

ner

icth

rift

yw

eak

AM

pS

keen

Gli

gor

Bre

sT

oueg

Ray

nal

AT

Rat

omet

al.

AB

cast

agre

em.

bcas

tge

ner

icor

der.

Qu

ick-

AxA

Mp

Alg

orit

hm

7.5

.17

.5.2

7.5

.37

.5.4

7.5

.57

.5.6

7.5

.77

.5.8

7.5

.97

.5.1

07

.5.1

17

.5.1

27

.5.1

37

.5.1

47

.5.1

5G

ener

alcl

ass

dest

inat

ion

sag

reem

ent

tim

e-ba

sed

Sys

tem

mod

elsy

nch

ron

yas

ynch

ron

ous

syn

c.cr

ash

spec

.

om

issi

on

B

yzan

tin

e

part

itio

nab

lere

liab

le

n

/a

FIF

O

n

/aC

ond

itio

nn

eed

edfo

rli

ven

ess

live

...C

ons.

Con

Date post:	06-Oct-2015
Category:	Documents
Upload:	anjaneya-prasad-n
View:	7 times
Download:	0 times

Total Order Broadcast and Multicast Algorithms

Documents