+ All Categories
Home > Documents > RBFT: Redundant Byzantine Fault Tolerance - … · RBFT: Redundant Byzantine Fault Tolerance...

RBFT: Redundant Byzantine Fault Tolerance - … · RBFT: Redundant Byzantine Fault Tolerance...

Date post: 04-Jun-2018
Category:
Upload: ngotruc
View: 224 times
Download: 0 times
Share this document with a friend
15
RBFT: Redundant Byzantine Fault Tolerance Pierre-Louis Aublin Grenoble University Sonia Ben Mokhtar CNRS - LIRIS Vivien Qu´ ema Grenoble INP Abstract—Byzantine Fault Tolerant state machine replication (BFT) protocols are replication protocols that tolerate arbitrary faults of a fraction of the replicas. Although significant efforts have been recently made, existing BFT protocols do not provide acceptable performance when faults occur. As we show in this paper, this comes from the fact that all existing BFT protocols tar- geting high throughput use a special replica, called the primary, which indicates to other replicas the order in which requests should be processed. This primary can be smartly malicious and degrade the performance of the system without being detected by correct replicas. In this paper, we propose a new approach, called RBFT for Redundant-BFT: we execute multiple instances of the same BFT protocol, each with a primary replica executing on a different machine. All the instances order the requests, but only the requests ordered by one of the instances, called the master instance, are actually executed. The performance of the different instances is closely monitored, in order to check that the master instance provides adequate performance. If that is not the case, the primary replica of the master instance is considered malicious and replaced. We implemented RBFT and compared its performance to that of other existing robust protocols. Our evaluation shows that RBFT achieves similar performance as the most robust protocols when there is no failure and that, under faults, its maximum performance degradation is about 3%, whereas it is at least equal to 78% for existing protocols. I. I NTRODUCTION Byzantine Fault Tolerant (BFT) state machine replication is an efficient and effective approach to deal with arbitrary software and hardware faults [1], [5], [7], [8], [18]. The wide range of research carried out in the field of BFT in the last decade primarily focused on building fast BFT protocols, i.e., protocols that are designed to provide the best possible performance in the common case (i.e. in the absence of faults) [4], [9], [11], [12]. More recently, interest has been given to robustness, i.e., building BFT protocols that achieve good performance when faults occur. Three protocols have been proposed to address this issue that are, Prime [2], Aardvark [6], and Spinning [17]. Unfortunately, as shown in Table I (details are provided in Section III), these protocols are not effectively robust: the maximum performance degradation they can suffer when some faults occur is at least 78%, which is not acceptable. Prime Aardvark Spinning Maximum throughput degradation 78% 87% 99% TABLE I: Performance degradation of “robust” BFT protocols under attack. The reason why the above mentioned BFT protocols are not robust is that they rely on a dedicated replica, called primary, to order requests. Even if there exists several mechanisms to detect and recover from a malicious primary, the primary can be smartly malicious. Despite efforts from other replicas to control that it behaves correctly, it can slow the performance down to the detection threshold, without being caught. To design a really robust BFT protocol, a legitimate idea that comes to mind is to avoid using a primary. One such protocol has been proposed by Boran and Schiper [3]. This protocol has a theoretical interest, but it has no practical interest. Indeed, the price to pay to avoid using a primary is that, before ordering every request, replicas need to be sure that they received a message from all other correct replicas. As replicas do not know which replicas are correct, they need to wait for a timeout (that is increased if it is not long enough). This yields very poor performance and this explains why this protocol has never been implemented. A number of other protocols have been devised to enforce intrusion tolerance (e.g., [16]). These protocols rely on what is called proactive recovery, in which nodes are periodically rejuvenated (e.g., their cryptographic keys are changed and/or a clean version of their operating system is loaded). If performed sufficiently often, node rejuvenation makes it difficult for an attacker to corrupt enough nodes to harm the system. These solutions are complementary to the robustness mechanisms studied in this paper. In this paper, we propose RBFT (Redundant Byzantine Fault Tolerance), a new approach to designing robust BFT protocols. In RBFT, multiple instances of a BFT protocol are executed in parallel. Each instance has a primary replica. The various primary replicas are all executed on different machines. While all protocol instances order requests, only one instance (called the master instance) effectively executes them. Other instances (called backup instances) order requests in order to compare the throughput they achieve to that achieved by the master instance. If the master instance is slower, the primary of the master instance is considered malicious and the replicas elect a new primary, at each protocol instance. Note that RBFT is intended for open loop systems (such as e.g., Zookeeper [10] or Boxwood [14] asynchronous API), i.e., systems where a client may send multiple requests in parallel without waiting the reception of replies of anterior requests. Indeed, in a closed loop system, the rate of incoming requests would be conditioned by the rate of the master instance. Said differently, backup instances would never be faster than the master instance. RBFT further implements a fairness mecha- nism between clients by monitoring the latency of requests, which assures that client requests are fairly processed.
Transcript

RBFT: Redundant Byzantine Fault TolerancePierre-Louis AublinGrenoble University

Sonia Ben MokhtarCNRS - LIRIS

Vivien QuemaGrenoble INP

Abstract—Byzantine Fault Tolerant state machine replication(BFT) protocols are replication protocols that tolerate arbitraryfaults of a fraction of the replicas. Although significant effortshave been recently made, existing BFT protocols do not provideacceptable performance when faults occur. As we show in thispaper, this comes from the fact that all existing BFT protocols tar-geting high throughput use a special replica, called the primary,which indicates to other replicas the order in which requestsshould be processed. This primary can be smartly malicious anddegrade the performance of the system without being detectedby correct replicas. In this paper, we propose a new approach,called RBFT for Redundant-BFT: we execute multiple instancesof the same BFT protocol, each with a primary replica executingon a different machine. All the instances order the requests, butonly the requests ordered by one of the instances, called themaster instance, are actually executed. The performance of thedifferent instances is closely monitored, in order to check thatthe master instance provides adequate performance. If that is notthe case, the primary replica of the master instance is consideredmalicious and replaced. We implemented RBFT and comparedits performance to that of other existing robust protocols. Ourevaluation shows that RBFT achieves similar performance asthe most robust protocols when there is no failure and that,under faults, its maximum performance degradation is about3%, whereas it is at least equal to 78% for existing protocols.

I. INTRODUCTION

Byzantine Fault Tolerant (BFT) state machine replicationis an efficient and effective approach to deal with arbitrarysoftware and hardware faults [1], [5], [7], [8], [18]. The widerange of research carried out in the field of BFT in the lastdecade primarily focused on building fast BFT protocols,i.e., protocols that are designed to provide the best possibleperformance in the common case (i.e. in the absence offaults) [4], [9], [11], [12]. More recently, interest has beengiven to robustness, i.e., building BFT protocols that achievegood performance when faults occur. Three protocols havebeen proposed to address this issue that are, Prime [2],Aardvark [6], and Spinning [17]. Unfortunately, as shown inTable I (details are provided in Section III), these protocols arenot effectively robust: the maximum performance degradationthey can suffer when some faults occur is at least 78%, whichis not acceptable.

Prime Aardvark SpinningMaximum throughput degradation 78% 87% 99%

TABLE I: Performance degradation of “robust” BFT protocolsunder attack.

The reason why the above mentioned BFT protocols are notrobust is that they rely on a dedicated replica, called primary,

to order requests. Even if there exists several mechanisms todetect and recover from a malicious primary, the primary canbe smartly malicious. Despite efforts from other replicas tocontrol that it behaves correctly, it can slow the performancedown to the detection threshold, without being caught. Todesign a really robust BFT protocol, a legitimate idea thatcomes to mind is to avoid using a primary. One such protocolhas been proposed by Boran and Schiper [3]. This protocolhas a theoretical interest, but it has no practical interest.Indeed, the price to pay to avoid using a primary is that,before ordering every request, replicas need to be sure thatthey received a message from all other correct replicas. Asreplicas do not know which replicas are correct, they need towait for a timeout (that is increased if it is not long enough).This yields very poor performance and this explains why thisprotocol has never been implemented. A number of otherprotocols have been devised to enforce intrusion tolerance(e.g., [16]). These protocols rely on what is called proactiverecovery, in which nodes are periodically rejuvenated (e.g.,their cryptographic keys are changed and/or a clean versionof their operating system is loaded). If performed sufficientlyoften, node rejuvenation makes it difficult for an attacker tocorrupt enough nodes to harm the system. These solutions arecomplementary to the robustness mechanisms studied in thispaper.

In this paper, we propose RBFT (Redundant ByzantineFault Tolerance), a new approach to designing robust BFTprotocols. In RBFT, multiple instances of a BFT protocolare executed in parallel. Each instance has a primary replica.The various primary replicas are all executed on differentmachines. While all protocol instances order requests, only oneinstance (called the master instance) effectively executes them.Other instances (called backup instances) order requests inorder to compare the throughput they achieve to that achievedby the master instance. If the master instance is slower, theprimary of the master instance is considered malicious andthe replicas elect a new primary, at each protocol instance.Note that RBFT is intended for open loop systems (such ase.g., Zookeeper [10] or Boxwood [14] asynchronous API), i.e.,systems where a client may send multiple requests in parallelwithout waiting the reception of replies of anterior requests.Indeed, in a closed loop system, the rate of incoming requestswould be conditioned by the rate of the master instance. Saiddifferently, backup instances would never be faster than themaster instance. RBFT further implements a fairness mecha-nism between clients by monitoring the latency of requests,which assures that client requests are fairly processed.

We implemented RBFT and compared its performance tothat achieved by Prime, Aardvark, and Spinning. Our eval-uation on a cluster of machines shows that RBFT achievescomparable performance in the fault-free case to the mostrobust protocols, and that it only suffers a 3% performancedegradation under failures.

The rest of the paper is organized as follows. We firstpresent the system model in Section II. We then present ananalysis of state-of-the-art robust BFT protocols in Section III.In Section IV we present the design and principles of Redun-dant Byzantine Fault Tolerance, and we present in Section Van instantiation of it: the RBFT protocol. In Section VI wepresent a theoretical analysis of RBFT. In Section VII wepresent our experimental evaluation of RBFT. Finally, weconclude the paper in Section VIII.

II. SYSTEM MODEL

The system is composed of N nodes. We assume theByzantine failure model, in which any finite number of faultyclients can behave arbitrarily and at most f = bN−1

3 cnodes are faulty, which is the theoretical lower bound [13].We consider the physical machine as the smallest faulty-component: if a single process is compromised, then weconsider that the whole machine is compromised. Faultynodes and clients can collude to compromise the replicatedservice. Nevertheless, they cannot break cryptographic tech-niques (e.g., signatures, message authentication codes (MACs),collision-resistant hashing). Furthermore, we assume an asyn-chronous network where synchronous intervals, during whichmessages are delivered within an unknown bounded delay,occur infinitely often. Finally, we denote a message m signedby node i’s public key by 〈m〉σi , a message m authenticated bya node i with a MAC for a node j by 〈m〉µi,j , and a messagem authenticated by a node i with a MAC authenticator, i.e.,an array containing one MAC per node, by 〈m〉~µi . Our systemmodel is in line with the assumptions of other papers in thefield, e.g., [4].

We address in this paper the problem of robust ByzantineFault Tolerant state machine replication in open-loop systems,i.e., systems in which clients do not need to wait for the replyof a request before sending new requests [15]. In an open loopsystem, even if the malicious primary of the master instancedelays requests, correct primaries of the backup instances willstill be able to order new requests coming from clients andthus to detect the misbehaving primary. This is not the case inclosed-loop systems. We will consider the robustness of BFTprotocols intended for closed-loop systems in our future work.

III. ANALYSIS OF EXISTING ROBUST BFT PROTOCOLS

We present in this section an analysis of existing robustBFT protocols i.e., Prime [2], Spinning [17] and Aardvark [6].These three protocols are the only protocols designed totarget the robustness problem. Indeed, an analysis of otherfamous BFT protocols, e.g., PBFT [4], QU [1], HQ [7] andZyzzyva [12], performed in [6], has shown that these protocolssuffer from a robustness issue. Specifically, although they

are build to eventually recover from attacks, the throughputof all of them drops to zero during a possibly long timeinterval corresponding to the duration of the attack, whichis not acceptable for their clients. In all these protocols, thesystem is composed of N = 3f+1 replicas, among which onehas the role for proposing sequence number to requests, i.e.,the primary. Moreover, in order to compute the theoreticalperformance loss of the various protocols, we examine theprotocol actions performed by the different nodes to processa batch of requests of size b. We consider the computationalcosts of the cryptographic operations. Specifically, we considerthat verifying and generating a signature costs θ cycles,verifying and generating a MAC authenticator costs α cyclesand generating the digest of a message costs δ cycles.

A. Prime

In Prime [2], clients send their requests to any replicain the system. Replicas periodically exchange the requeststhey receive from clients. As such, they are aware of currentrequests to order and start expecting ordering messages fromthe primary which should contain them. Furthermore, whetherthere are requests to order or not, the primary must period-ically send (possibly empty) ordering messages, every ∆pp

cycles. This allows non-primary replicas to expect orderingmessages with a given frequency. In order to improve theaccuracy of the expected frequency at which a primary shouldsend messages, replicas monitor the network performance.Specifically, replicas periodically measure the round-trip timebetween each pair of them. This measure allows them tocompute the maximum delay that should separate the sendingof two ordering messages performed by a correct primary.This delay is computed as a function of three parameters: theround-trip time between replicas, noted rtt, the time neededto execute a batch of requests, noted E, and a constant thataccounts for the variability of the network latency, noted Klat

which is set by the developer. If the primary becomes slowerthan what is expected by the replicas, then it is replaced.

The Prime protocol is not robust for the following reason.If the monitoring is inaccurate, the delay expected for aprimary to send ordering messages can be too long, whichgives the opportunity for a malicious primary to delay orderingmessages. We performed the following experiment (in orderto increase the round-trip time): a malicious primary colludeswith a single faulty client. The latter sends a request that isheavier to process than other requests (1ms vs 0.1ms in ourexperiments). This increases the monitored round-trip time,and this gives the opportunity for the malicious primary todelay requests issued by correct clients. Figure 1 presents thethroughput under attack relative to the throughput in the fault-free case, in percentage, as a function of the requests size, forboth a static and a dynamic load (details on the two workloadsare given in Section VII). The theoretical performance losshas been computed using the Prime formulas of Table II.We observe that the primary is able to degrade the systemthroughput down to 22% of the performance in the fault-freecase. In other words, the throughput under attack drops by up

0

10

20

30

40

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

Fig. 1: Prime throughput under attack relative to the through-put in the fault-free case.

to 78% (the theoretical value being 73.2%).

Fault free case Under attack

CFF =θ(2b+11f+5)+δ(6f+1)+∆pp

bCFF +

(rtt+E)∗Klatb

TABLE II: Number of cycles required by Prime to processone request in the fault-free case and under attack.

B. Aardvark

Aardvark [6] is a BFT protocol based on PBFT [4], thepractical BFT protocol presented by Castro and Liskov. Animportant principle in the robustness of Aardvark is thepresence of regular changes of the primary replica. Each timethe primary is changed, a new configuration, called view, isstarted. The authors of Aardvark argue that regularly changingthe primary allows limiting the throughput degradation amalicious primary may cause. This regular primary changesare performed as follows. A primary replica is required toachieve at the beginning of a view a throughput at least equalto 90% of the maximum throughput achieved by the primaryreplicas of the last N views (where N is the number ofreplicas). After an initial grace period of 5 seconds wherethe required throughput is stable, the non-primary replicasperiodically raise this required throughput by a factor of 0.01,until the primary replica fails to provide it. At that point,a primary change occurs and a new replica becomes theprimary. In addition to expecting a minimal throughput fromthe primary, the replicas monitor the frequency at which theprimary sends ordering messages. Specifically, replicas starta timer, called heartbeat timer, after the reception of eachordering message from the primary. If this timer expires beforethe primary sends another ordering message, a primary changeis voted by replicas. In addition to regular view changes andheartbeat timers, Aardvark implements a number of robustnessmechanisms to deal with malicious clients and replicas. Forinstance, it uses separate Network Interface Controllers (NICs)for clients and replicas. This avoids client traffic to slow downthe replica-to-replica communication. Further, this enablesthe isolation of replicas that would flood the network withunfaithful messages.

As long as the system is saturated, the amount of damagea faulty primary can do on the system is limited, as thethroughput expected by replicas is close to the maximalthroughput clients can sustain. We performed an experiment

0

20

40

60

80

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

Fig. 2: Aardvark throughput under attack relative to thethroughput in the fault-free case.

in which the primary tries to delay requests as much as it can,under a static load. Results, depicted in Figure 2, show thatthe throughput provided by the system under attack is at least76% of the throughput observed in the fault-free case. Whenthe load is dynamic, however, the performance degradationcan potentially be much higher. Indeed, when the load is low,the expected throughput computed by replicas is also low. Ifthe load suddenly increases, a malicious primary can benefitfrom the low expectations computed by replicas to delayrequests. More precisely, the malicious primary measuresthe observed throughput obs, computes the next expectedthroughput exp and delays the requests by Adelay(obs, exp)such that it provides this expected throughput. We performeddifferent experiments under the dynamic load described inSection VII, for different message sizes. Results, depicted inFigure 2, show that because of a malicious primary under adynamic load, the throughput of the system can drop down to13% of the throughput that would have been provided if theprimary would have been detected (the maximum throughputdegradation being 87%), and up to 85.9% using the theoreticalformulas presented in Table III.

Fault free case Under attackCFF = max(θ + α+ δ,

α(10f+b)+δb

) CFF +Adelay(obs, exp)

TABLE III: Number of cycles required by Aardvark to processone request in the fault-free case and under attack. The delayAdelay depends on the observed throughput obs and on theexpected throughput exp.

C. SpinningSpinning [17] is a BFT protocol also based on PBFT [4].

Similarly to Aardvark, Spinning performs regular primarychanges. The particularity of Spinning is that these primarychanges are automatically performed after the primary hasordered a single batch of requests. In this protocol, requestsare sent by clients to all replicas. As soon as a non-primaryreplica receives a request, it starts a timer and waits for arequest ordering message from the primary containing thisrequest. If the timer expires, after a duration Stimeout, then thecurrent primary is blacklisted (i.e., it will no longer become aprimary in the future1), another replica becomes the primary,

1If f replicas are already blacklisted, then the oldest one is removed fromthe blacklist, to ensure the liveness of the system.

0

5

10

15

20

25

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

Fig. 3: Spinning throughput under attack relative to thethroughput in the fault-free case.

and Stimeout is doubled. As soon as a request has beensuccessfully ordered, the primary is automatically changed(i.e., there is no message exchange between replicas) and thevalue of Stimeout is reset to its initial value. Note that thevalue of Stimeout is a system parameter statically defined; itdoes not depend on live monitoring of the system.

The Spinning protocol is not robust for the followingreason. A malicious primary can delay the request orderingmessages by a little less than Stimeout. It will drasticallyreduce the throughput, without being detected. As a result,it can continue to delay future ordering messages, the nexttime it becomes the primary. We ran several experimentswhere the malicious primary was delaying the sending of theordering messages by 40ms (which is the value used by theauthors of Spinning in [17]), for both under a static and adynamic workload. Results, depicted in Figure 3, show thatthe throughput drops dramatically down to 1% and 4.5% of thefault-free throughput, under the static and dynamic workloads,respectively. This throughput degradation of up to 99% isclearly not acceptable. These results are close to the theoreticalresult of 96.9%, computed from the formulas of Table IV.

Fault free case Under attackCFF =

α(2b+13f)+δ(2b+1)b

(n−f)∗CFF+f∗(CFF+Stimeout)n

TABLE IV: Number of cycles required by Spinning to processone request in the fault-free case and under attack.

D. Summary

In this section we have seen that so-called robust BFT pro-tocols, i.e., Prime, Aardvark and Spinning, are not effectivelyrobust. A primary node can be smartly malicious and causehuge performance degradation without being caught. Primeis robust as long as the network meets a certain level ofsynchrony. If the variance of the network is too high, thena malicious primary can heavily impact the performance ofthe system. Aardvark is robust as long as the load is static.If Aardvark is fed up with a dynamic load (for instance,a load corresponding to connections to a website, whichmay contain many spikes), then it is not guaranteeing goodperformance anymore. Spinning is robust only 2f+1 requestsover 3f + 1, when the current primary is a correct replica.However, Spinning has an important weakness every timea malicious primary is in place: the malicious primary can

delay the sending of the ordering messages up to the maximalallowed time.

IV. THE RBFT PROTOCOL

We have seen in the previous section that existing BFTprotocols that claim to be robust are not actually robust. In thissection, we present RBFT, a new BFT protocol stems from theconcepts of Redundant Byzantine Fault Tolerance: multiplesinstances of a BFT protocol are executed simultaneously, eachone with a primary replica running on a different machine. Westart by an overview of RBFT. We then detail the various stepsfollowed by replicas. Finally, we detail two key mechanismsthat are used in RBFT: the monitoring mechanism and theprotocol instance change mechanism.

A. Protocol overview

As for the other robust BFT protocols, RBFT requires 3f+1nodes (i.e., 3f + 1 physical machines). Each node runs f + 1protocol instances of a BFT protocol in parallel (see Figure 4).As we theoretically show in Section VI, f + 1 protocolinstances is necessary and sufficient to detect a faulty primaryand ensure the robustness of the protocol. This means thateach of the N nodes in the system runs locally one replica foreach protocol instance. Note that the different instances orderthe requests following a 3-phase commit protocol similar toPBFT [4]. Primary replicas of the various instances are placedon nodes in such a way that, at any time, there is at most oneprimary replica per node. One of the f + 1 protocol instancesis called the master instance, while the others are called thebackup instances. All instances order client requests, but onlythe requests ordered by the master instance are executed bythe nodes. Backup instances only order requests in order tobe able to monitor the master instance. For that purpose, eachnode runs a monitoring module that computes the throughputof the f+1 protocol instances. If 2f+1 nodes observe that theratio between the performance of the master instance and thebest backup instance is lower than a given threshold, then theprimary of the master instance is considered to be malicious,and a new one is elected. Intuitively, this means that a majorityof correct nodes agree on the fact that a protocol instancechange is necessary (the full correctness proof of RBFT canbe found in Section VI). An alternative could be to changethe master instance to the instance which provides the highestthroughput. This would require a mechanism to synchronizethe state of the different instances when switching, similar tothe switching mechanism of Abstract [9]. We will explore thisdesign in our future work.

In RBFT, the high level goal is the same as for the otherrobust BFT protocols we have studied previously: replicasmonitor the throughput of the primary and trigger the recoverymechanism when the primary is slow. The approach we useis radically different. It is not possible for replicas to guesswhat the throughput of a non-malicious primary would be.Therefore, the key idea is to leverage multicore architectures torun multiple instances of the same protocol in parallel. Nodeshave to compare the throughput achieved by the different

instances to know whether a protocol instance change isrequired or not. We are confident to say (given the theoreticaland experimental analysis) that this approach allows us to buildan effectively robust BFT protocol.

Node 0 Node 1 Node 2 Node 3MasterProtocolInstance

BackupProtocol Instance

Primary

Primary Replica Replica Replica

ReplicaReplicaReplica

Clients

Fig. 4: RBFT overview (f = 1).

For RBFT to correctly work it is important that the f + 1instances receive the same client requests. To that purposewhen a node receives a client request, it does not give itdirectly to the f + 1 replicas it is hosting. Rather, it forwardsthe request to all other nodes. When a node has received f+1copies of a client request (possibly including its own copy),it knows that every correct node will eventually receive therequest (because the request has been sent to at least onecorrect node). Consequently, it gives the request to the f + 1replicas it hosts.

Finally, note that each protocol instance implements a full-fledged BFT protocol, very similar to the Aardvark protocoldescribed in the previous section. There is nevertheless asignificant difference: a protocol instance does not proceedto a view change by its own. Indeed, the view changes inRBFT are controlled by the monitoring mechanism and applyon every protocol instance at the same time.

B. Detailed protocol steps

RBFT protocol steps are described hereafter and depictedin Figure 5. The numbering of the steps is the same as theone used in the figure.

PRE-PREPARE PREPARE COMMIT

3 4 5

PRE-PREPARE PREPARE COMMIT

3 4 5

REQUEST REPLYClient

Node 0

Node 1

1 62

PROPAGATE

Node 2

Node 3

Redundant agreement performed by the replicas

Fig. 5: RBFT protocol steps (f = 1).

1. The client sends a request to all the nodes. A clientc sends a REQUEST message 〈〈REQUEST, o, rid, c〉σc , c〉~µc toall the nodes (Step 1 in the figure). This message contains therequested operation o, a request identifier rid, and the clientid c. It is signed with c’s private key, and then authenticatedwith a MAC authenticator for all nodes. On reception of a

REQUEST message, a node i verifies the MAC authenticator.If the MAC is valid, it verifies the signature of the request. Ifthe signature is invalid, then the client is blacklisted: furtherrequests will not be processed2. If the request has already beenexecuted, i resends the reply to the client. Otherwise, it movesto the following step. We use signatures as the request needsto be forwarded by nodes to each other and using a MACauthenticator alone would fail in guaranteeing non-repudiation,as detailed in [6]. Similarly, using signatures alone wouldopen the possibility for malicious clients to overload nodesby sending unfaithful requests with wrong signatures, that aremore costly to verify than MACs.

2. The correct nodes propagate the request to all thenodes. Once the request has been verified, the node sends a〈PROPAGATE, 〈REQUEST, o, s, c〉σc , i〉~µi message to all nodes.This step ensures that every correct node will eventuallyreceive the request as long as the request has been sent toat least one correct node. On reception of a PROPAGATEmessage coming from node j, node i first verifies the MACauthenticator. If the MAC is valid, and it is the first time ireceives this request, i verifies the signature of the request.If the signature is valid, i sends a PROPAGATE message tothe other nodes. When a node receives f + 1 PROPAGATEmessages for a given request, the request is ready to be givento the replicas of the f + 1 protocol instances running locally,for ordering. As proved in Section VI, only f+1 PROPAGATEmessages are sufficient to guarantee that if malicious primariescan order a given request, all correct primaries will eventuallybe able to order the same request. Note that the replicas donot order the whole request but only its identifiers (i.e., theclient id, request id and digest). Not only the whole request isnot necessary for the ordering phase, but it also improves theperformance as there is less data to process.

3. 4. and 5. The replicas of each protocol instance executea three phase commit protocol to order the request. Whenthe primary replica p of a protocol instance receives a request,it sends a PRE-PREPARE message 〈PRE-PREPARE, v, n, c,rid, d〉~µp authenticated with a MAC authenticator for everyreplica of its protocol instance (Step 3 in the figure). A replicathat is not the primary of its protocol instance stores themessage and expects a corresponding PRE-PREPARE message.When a replica receives a PRE-PREPARE message from theprimary of its protocol instance, it verifies the validity ofthe MAC. It then replies to the PRE-PREPARE message bysending a PREPARE message to all other replicas, only if thenode it is running on already received f+1 copies of therequest. Without this verification, a malicious primary maycollude with faulty clients that would send correct requestsonly to him, in order to boost the performance of the protocolinstance of the malicious primary at the expense of the otherprotocol instances. Following the reception of 2f matching

2Deviations performed by clients or nodes on the security protocols, e.g.,a client that would continuously initiate the key exchange protocol, areconsidered out of the scope of the paper and will be studied as future work

PREPARE messages from distinct replicas of the same protocolinstance that are consistent with a PRE-PREPARE message,a replica r sends a commit message 〈COMMIT, v, n, d, r〉~µrthat is authenticated with a MAC authenticator (Step 5 inthe figure). After the reception of 2f + 1 matching COMMITmessages from distinct replicas of the same protocol instance,a replica gives back the ordered request to the node it isrunning on.

6. The nodes execute the request and send a reply messageto the client. Each time a node receives an ordered requestfrom a replica of the master instance, the request operationis executed. After the operation has been executed, the nodei sends a REPLY message 〈REPLY, u, i〉µi,c to client c that isauthenticated with a MAC, where u is the result of the requestexecution (Step 6 in the figure). When the client c receivesf+1 valid and matching 〈REPLY, u, i〉µi,c from different nodesi, it accepts u as the result of the execution of the request.

C. Monitoring mechanism

RBFT implements a monitoring mechanism to detectwhether the master protocol instance is faulty or not. Thismonitoring mechanism works as follows. Each node keepsa counter nbreqsi for each protocol instance i, which cor-responds to the number of requests that have been orderedby the replica of the corresponding instance (i.e. for which2f + 1 COMMIT messages have been collected). Periodically,the node uses these counters to compute the throughput ofeach protocol instance replica and then resets the counters.The throughput values are compared as follows. If the ratiobetween the throughput of the master instance tmaster and theaverage throughput of the backup instances tbackup is lowerthan a given threshold ∆, then the primary of the masterprotocol instance is suspected to be malicious, and the nodeinitiates a protocol instance change, as detailed in the nextsection. The value of ∆ depends on the ratio between thethroughput observed in the fault-free case and the throughputobserved under attack. Note that the state of the differentprotocol instances is not synchronized: they can diverge andorder requests in different orders without compromising thesafety nor the liveness of the protocol.

In addition to the monitoring of the throughput, the moni-toring mechanism also tracks the time needed by the replicasto order the requests. This mechanism ensures that the primaryof the master protocol instance is fair towards all the clients.Specifically, each node measures the individual latency ofeach request, say latreq and the average latency for eachclient, say latc, for each replica running on the same node.A configuration parameter, Λ, defines the maximal acceptablelatency for any given request. Another parameter, Ω, definesthe maximal acceptable difference between the average latencyof a client on the different protocol instances. These twoparameters depend on the workload and on the experimentalsettings. When the node sends a request to the various replicasrunning locally for ordering, it records the current time.Then, when it receives the corresponding ordered request, it

computes its latency latreq and the average latency for all therequests of this client latc. If this request has been ordered bythe replica of the master instance and if latreq is greater thanΛ, or the difference between latc and the average latency forthis client on the other protocol instances is greater than Ω,then the node starts a protocol instance change. Note that wedefine the value of the different parameters, ∆, Λ and Ω boththeoretically and experimentally, as described in Section VI:their value depends on the cost of the cryptographic operationsand on the network conditions.

D. Protocol instance change mechanism

In this section, we describe the protocol instance changemechanism that is used to replace the faulty primary at themaster protocol instance. Because there is only at most oneprimary per node in RBFT, this also implies to replace all theprimaries on all the protocol instances.

Each node i keeps a counter cpii, which uniquely iden-tifies a protocol instance change message. When a nodei detects too much difference between the performance ofthe master instance and the performance of the backup in-stances (as detailed in the previous section), it sends an〈INSTANCE CHANGE, cpii, i〉~µi message authenticated with aMAC authenticator to all the other nodes.

When a node j receives an INSTANCE CHANGE messagefrom node i, it verifies the MAC and handles it as follows. Ifcpii < cpij , then this message was intended for a previousINSTANCE CHANGE and is discarded. On the contrary, ifcpii ≥ cpij , then the node checks if it should also sendan INSTANCE CHANGE message. It does so only if it alsoobserves too much difference between the performance of thereplicas. Upon the reception of 2f + 1 valid and matchingINSTANCE CHANGE messages, the node increments cpi andinitiates a view change on every protocol instance that runslocally. As a result, each protocol instance elects a newprimary and the malicious replica of the master instance isno longer the primary.

V. IMPLEMENTATION

We have implemented RBFT in C++ using the Aardvarkcode base. Similarly to Aardvark, RBFT adopts separateNetwork Interface Controllers (NICs). Not only this allows toavoid client traffic to slow down node-to-node communication,but it also protects against a flooding attack performed byfaulty nodes. In this situation, RBFT closes the NIC of thefaulty node for a given time period, which gives time to thefaulty node to restart or get repaired without penalizing theperformance of the whole system. Moreover, as our implemen-tation of Aardvark, RBFT uses TCP. We use it because it easesthe development task: on the contrary of UDP, TCP provides aloss-less, FIFO communication channel. Previous works (e.g.,Zyzzyva [12] or PBFT [4]) have shown that the bottleneck inBFT protocols is actually cryptography, not network usage.Interestingly, the BFT protocol that currently provides thehighest throughput uses TCP [9]. This protocol, called Chain,implies the smallest number of cryptographic operations at

the bottleneck replicas, hence its very good performance. Forcomparison we also have implemented a UDP version ofRBFT. We show, in Section VII-B, that this implementationprovides the same performance.

Figure 6 depicts the architecture of a single node in RBFTin the case when one fault is tolerated (i.e. f = 1). We observethat two replicas belonging to two different protocol instancesare hosted by the node: replica 0 from protocol instance 0(noted p0,0 in the figure) and replica 0 from protocol instance1 (noted p0,1 in the figure). We also observe that the node uses3f + 1 = 4 NICs, as Aardvark: 1 NIC for the communicationwith the clients, and 1 NIC per other node.

As depicted in the figure, a number of modules run on eachnode. We describe the behavior of these modules below.

Node 0

Verification

Execution

Dispatch&

Monitoring

NIC

NIC

NIC

p0,0

p0,1

Propagation Node 1

Node 2

Node 3

NIC

Clients

Fig. 6: Architecture of a single node (f = 1). The replicas (incircles) are processes that are independent from the differentmodules, implemented as threads (in rectangles).

When a request is received from a client, it is verifiedby the Verification module. If the request is correct then thePropagation module disseminates it to all the other nodesthrough the node-to-node NICs and waits for similar messagesfrom them. Once it has received f + 1 such messages, thePropagation module sends the request to the Dispatch &Monitoring module, which logs the request and gives it tothe replicas of the protocol instances running locally (i.e.,p0,0 and p0,1 in the figure). The replicas interact with otherreplicas of the same protocol instance (running on the othernodes) via separate NICs, to order the request. Then, eachreplica gives back the ordered request to the Dispatch &Monitoring module. The latter sends ordered requests comingfrom the master protocol instance to the Execution module,which executes them and replies to the client.

The Verification, Propagation, Dispatch & Monitoring andExecution modules are implemented as separate threads. More-over, the replicas are implemented as independent processes.The various threads and processes are deployed on distinctcores (our machines have 8 cores). Leveraging multicoremachines improves the performance of the system, as thedifferent protocol instances can really execute concurrently.

VI. RBFT THEORETICAL ANALYSIS

We present in this section a theoretical analysis of RBFT.We first present a throughput analysis in Section VI-A. Giventhis analysis, we show how to compute the minimal acceptableratio between the performance of the master instance and the

performance of the backup instances in Section VI-B. Then,in Section VI-C, we prove that having f+1 protocol instancesis necessary and sufficient. Finally, we prove that, thanks tothe PROPAGATE phase, all the correct nodes eventually sendthe request to the ordering (Section VI-D) and that if 2f + 1correct nodes decide to perform a protocol instance change,then a protocol instance change will eventually be performedby every correct node (Section VI-E).

A. Throughput analysis

We analyze in this section the throughput of RBFT duringboth gracious and uncivil intervals. We compute RBFT peakperformance in Theorem 1 and show in Theorem 2 that theperformance during an uncivil execution is within a constantfactor of the peak performance.

We consider a multi-core machine where each replica andeach module (i.e., the verification, propagate, dispatch andmonitoring and execution modules) run in parallel on distinctcores. Each core has a speed of κ Hz. We consider onlythe computational costs of the cryptographic operations (i.e.,verifying signatures, generating MACs, and verifying MACs,requiring θ, α and α cycles, respectively). We ignore the costof request execution and the generation of client replies, asthe throughput of the protocol instances is monitored aheadof the execution of client requests. Moreover, this executiontakes place in a separate core and thus does not impact theperformance of the individual protocol instances.

Theorem 1. Consider a gracious view during which thesystem is saturated, all requests come from correct clients, allnodes are correct and the replicas order batches of requestsof size b. For each protocol instance, the throughput is atmost tpeak = κ

max(θ+α(n+f), α 2n+4f−1b )

req/s.

Proof. First, we compute the number of operations per-formed by each node to send a request to the ordering stage.For one request, there is at most one MAC and one signatureverification, performed either when the request is receivedfrom the client, or when it is received during the PROPAGATEphase. Then, the node generates n − 1 MACs to create thePROPAGATE. Each node waits for f PROPAGATE, on which itverifies the MAC. Therefore, the cost for one request up tothe ordering stage is θ + α(n+ f).

Then we differentiate the operations performed by theprimary of a protocol instance and the non-primary replica.The primary generates n − 1 MACs for the PRE-PREPARE,while a non-primary replica verifies the PRE-PREPARE andgenerates n − 1 MACs for the PREPARE. Then, each replicaverifies 2f PREPARE, after which it generates n−1 MACs forthe COMMIT, and finally verifies 2f +1 COMMIT. As a result,for each request, the cost at the primary is α 2n+4f−1

b and thecost at a non-primary replica is α 2n+4f

b .Given that the propagate module and the

replicas run in parallel, the throughput is at mosttpeak = κ

max(θ+α(n+f), α 2n+4f−1b )

req/s.

Lemma 1. Consider an uncivil execution during whichthe system is saturated, but only a fraction g of requestsgenerated by clients are correct. Moreover the node runningthe primary of the master instance is correct and at mostf nodes are Byzantine. All the nodes receive the sameproportion of correct requests and the replicas receivebatches of requests of size b. For each protocol instance, thethroughput is tuncivil = κ

max( θg+α(1g+n+f−1), α 2n+4f−1

b )req/s.

Proof. We compute the number of operations performed byeach node to send a request to the ordering stage. Note that,thanks to the blacklisting mechanism, a node that sends toomany invalid messages or too much messages is isolated andno longer takes part in the system. For one correct request,there is at most one MAC and one signature verification. Thepropagate thread generates n−1 MACs to compute the PROP-AGATE. Every correct node waits for at most f PROPAGATE,for which it verifies the MAC. Consequently, the cost for onerequest up to the ordering stage is θ

g + α( 1g + n+ f − 1).

Then, the cost at the ordering step is the same cost as theone in the fault-free case: the cost at the primary is α 2n+4f−1

b

and the cost at a non-primary replica is α(2n+4f)b .

Consequently, the throughput is tuncivil =κ

max( θg+α(1g+n+f−1), α 2n+4f−1

b )req/s.

Theorem 2. Consider an uncivil execution during whichthe system is saturated, but only a fraction g of requestsgenerated by clients are correct. Moreover the node runningthe primary of the master instance is correct and at most fnodes are Byzantine. All nodes receive the same proportionof correct requests and the replicas order batches of requestsof size b. Then the performance is within a constant factorC ≤ 1 of the performance in a gracious execution in whichthe batch size is the same.

Proof. From Theorem 1 and Lemma 1 we know theperformance during a gracious and an uncivil execution.Moreover, the ratio α

θ tends to 0. Let v1 = θ/α+1−1/b10/b−4 and

v2 = θ/(gα)+1/g−1/b10/b−4 (v1 ≤ v2). Moreover, from the formulas

of tpeak and tuncivil, let A = θ + α(n+ f), B = α 2n+4f−1b

and C = θg +α( 1

g + n+ f − 1). We need to distinguish threecases 3:

1) A > B and C > B. It happens when f < v1. Inthis case, the ratio between the uncivil and graciousthroughputs tuncivil

tpeakis C = g.

2) A > B and C ≤ B. It happens when f ≥ v1 andf < v2. In this case, the ratio between the uncivil andgracious throughputs tuncivil

tpeaktends to C = 0.

3) A ≤ B and C ≤ B. It happens when f ≥ v2. Inthis case, the ratio between the uncivil and graciousthroughputs tuncivil

tpeakis C = 1.

3The fourth combination, A ≤ B and C > B, can not happen as it wouldmean to have v1 > f > v2.

B. Monitoring the protocol instances

We define in this section how to compute the minimalacceptable ratio between the performance of the masterinstance and the performance of the backup instances(Definition 1).

Definition 1 In Section IV-C we have seen that the per-formance of the master instance is compared with the maxperformance of the backup instances. If the difference isgreater than ∆, then the node initiates a protocol instancechange.

More precisely, the nodes monitor the throughput of thedifferent protocol instances. They compute the ratio betweenthe throughput of the master instance (noted tmaster) and themax throughput of the backup instances (noted tbackup). Theyexpect tmaster−tbackup

tmaster< ∆. We know that the maximum

acceptable difference, such that no protocol instance changeis initiated, is between the performance during a graciousexecution (Theorem 1) and an uncivil execution (Lemma 1).Consequently, ∆ =

tuncivil−tpeaktuncivil

.

Lemma 2. If the master protocol instance is malicious andprovides a throughput lower than the throughput during anuncivil execution, then it is detected and a protocol instancechange is initiated.

Proof. Let tdelay be the throughput of the master instance.According to Lemma 1, the throughput of the backupinstances is in the worst case tuncivil. By Definition 1, aprotocol instance change is initiated if tmaster−tbackup

tmaster< ∆.

We now compute tdelay: tmaster−tbackuptmaster

< ∆ ⇔tdelay−tuncivil

tdelay<

tuncivil−tpeaktuncivil

⇔ tuncivil(tdelay − tuncivil) <tdelay(tuncivil − tpeak) ⇔ tunciviltuncivil > tdelaytpeak⇔ tdelay <

tunciviltpeak

tuncivil.

From Theorem 2, we know that tunciviltpeaktends to C. Thus, a

protocol instance change is initiated if tdelay < C∗tuncivil. AsC ≤ 1, a protocol instance change is initiated if the throughputduring the attack is less than the throughput during an uncivilexecution. In other words, the malicious primary of the masterinstance cannot provide a throughput less than the throughputduring an uncivil execution without being replaced.

C. Number of protocol instances

In this section we prove that having f+1 protocol instancesis necessary and sufficient (Theorem 3).

Lemma 3. A protocol instance cannot provide a throughputhigher than the throughput observed in the fault-free case.

Proof. From Theorem 1 we know that the throughputdepends on the number of MAC and signature verifications,and on the number of MAC generations. In order to providea throughput higher than the peak throughput in the fault-freecase, a protocol instance needs to perform less verifications

and generations. However, the formula of tpeak considers thecase where, with at most f malicious replicas (i.e. the safetyproperty of the protocol holds), the number of verificationsand generations is minimal. Consequently, a protocol instancecannot provide a throughput higher than tpeak.

Invariant 1. Consider an uncivil execution during whichthe system is saturated and at most f nodes are byzantine.As long as the primary of the master instance is correct, aprotocol instance change will never happen.

Proof. We consider an uncivil execution, i.e. a proportiong of requests are correct and at most f nodes are Byzantines.Moreover, the primary of the master instance is correct. Wesuppose that a protocol instance change eventually happen andshow that we find a contradiction.

There is eventually a protocol instance change means thatthere is eventually at least 2f + 1 nodes that receive 2f + 1INSTANCE CHANGE.

There is at most f byzantine nodes, which may continuouslysend valid INSTANCE CHANGE. Thus it is sufficient that f +1 correct nodes send an INSTANCE CHANGE. For that, theyneed to observe better performance at the backup instancesthan at the master instance by a pre-determined threshold ∆(cf. Definition 1). Specifically, the malicious nodes have toaugment the performance of the backup instances and reducethe performance of the master instance.

We consider the following attack, which is the one wherethe difference between the master instance and the backupinstances is maximal. The clients send a proportion g < 1 ofcorrect requests to the master instance only and a proportiong = 1 of correct requests to the backup instances. Moreover,the malicious nodes flood only the master instance and thenode on which its primary runs (however they do not floodthe other instances running on this node). Thus the throughputof the master instance (noted tmaster) is given by Lemma 1and the maximal throughput of the backup instances (notedtbackup) is given by Theorem 1.

An INSTANCE CHANGE is sent by a correct node if theratio between tmaster and tbackup is lower than ∆. As shownin Definition 1, ∆ is precisely computed as the ratio betweenthe peak throughput and the uncivil throughput. Moreover, weknow from Lemma 3 that tbackup ≤ tpeak. As a consequence,an INSTANCE CHANGE is never sent.

Invariant 2. Consider an uncivil execution during whichthe system is saturated and at most f nodes are Byzantines.If the primary of the master instance is byzantine and ifits performance is not acceptable, then a protocol instancechange will eventually happen.

Proof. We consider an uncivil execution, i.e. a proportiong of requests are correct and at most f nodes are Byzantines.Moreover, the primary of the master instance is malicious. Wesuppose that a protocol instance change will never happen andshow that we find a contradiction.

We consider the following attack, which is the one wherethe difference between the master instance and the backupinstances is maximal. The clients send a proportion g < 1of correct requests to the backup instances. As the masterinstance is malicious, there is no requirement on the valueof g for it. Moreover, the malicious nodes flood all thecorrect nodes. Thus the throughput of the master instance(noted tmaster) is the throughput resulting of the attack, thethroughput of the other malicious instances nodes is alsotmaster, and the throughput of the correct instances is tuncivil(cf. Lemma 1). The total number of instances per node isf + 1. Consequently, there are f backup instances, amongwhich at most f − 1 are malicious and at least 1 is correct.Because tmaster < tuncivil, the maximal throughput of thebackup instances is equal to tuncivil.

An INSTANCE CHANGE is sent by a correct node if theratio between tmaster and tuncivil is lower than ∆ (cf.Definition 1). That is to say, the ratio between tmasterand tuncivil is lower than the ratio between the throughputduring an uncivil execution and the throughput during agracious execution. Given Lemma 2, if tmaster < tuncivil,then every correct node will detect the attack and willsend an INSTANCE CHANGE. Therefore, every correct nodeeventually receives 2f + 1 INSTANCE CHANGE and votes aprotocol instance change.

Theorem 3. Consider an uncivil execution during whichthe system is saturated and at most f nodes are Byzantine.Then, having f + 1 protocol instances is necessary andsufficient for the malicious nodes and the clients to beunable to alter the performance of the different instances suchthat the performance provided by the system are not adequate.

Proof. From Invariants 1 and 2 we know that if the numberof protocol instances is f + 1, then the malicious nodesand clients cannot force a protocol instance change or alterthe performance provided by the system without triggering aprotocol instance change. We now present an attack where,with less than f + 1 protocol instances, one of the invariantsis violated.

We consider f protocol instances and the attack presented inthe proof of Invariant 2. We show that this invariant is violated,i.e. the primary of the master instance is malicious and isnot detected. The proof is trivial: considering f maliciousnodes, all the protocol instances run a malicious primary. Asa result, they can all provide a null throughput. The primaryof the master protocol instance is not detected, which violatesInvariant 2.

D. Propagate phase

In this section, we prove that, thanks to the PROPAGATEphase, all the correct nodes eventually send the request to theordering.

Invariant 3. If a correct node sends a PROPAGATE, thenevery correct node will eventually receive f + 1 PROPAGATE.

Proof. By the contraposition: we suppose there exists acorrect node n1 which sends a PROPAGATE to all the nodesand there exists a correct node n2 which will never receivesf + 1 PROPAGATE. We show that we arrive at a contradiction.

The node n1 sends a PROPAGATE to all the nodes. It willeventually be received by all the correct nodes (as the systemis eventually synchronous). Let n3 be a correct node whichreceives the PROPAGATE from n1. We need to distinguish threecases:

1) it has already received the REQUEST. According to theprotocol algorithm (see Section IV-B), it has already senta PROPAGATE to all the nodes;

2) it has never received the REQUEST and has already re-ceived a valid PROPAGATE for this request. According tothe protocol algorithm, it has already sent a PROPAGATEto all the nodes;

3) it has never received the REQUEST nor a valid PROPA-GATE for this request. According to the protocol algo-rithm, it sends a PROPAGATE to all the nodes.

In every cases a PROPAGATE is sent by n3.As there are at least 2f + 1 correct nodes, at least

2f + 1 valid PROPAGATE are sent and eventually receivedby every correct node. Thus, every correct node, includingn2, eventually receives at least 2f + 1 PROPAGATE. Thiscontradicts the fact that n2 will never receives f + 1PROPAGATE.

Invariant 4. If a request is sent to the ordering by a correctnode, then it will eventually be sent to the ordering by all thecorrect nodes.

Proof. By the contraposition: we suppose a request is sentto the ordering by a correct node n1 and there exists a noden2 which will never send the request to the ordering. We showthat we arrive at a contradiction.

The node n2 will never send the request to the orderingis equivalent to it will never receive at least f + 1 validPROPAGATE.

The node n1 sends the request to the ordering is equivalentto it has received f + 1 valid PROPAGATE, which means thereare f + 1 nodes which have sent a valid PROPAGATE to n1.According to the algorithm a correct node that receives avalid PROPAGATE or a valid REQUEST sends a PROPAGATE.Consequently there exists a correct node (n1 in this case)which has sent a valid PROPAGATE to all the nodes.

According to the Invariant 3, if a correct node sends a validPROPAGATE, then every correct node will eventually receivef + 1 PROPAGATE.

Therefore n2, which is correct, eventually receives f + 1PROPAGATE. This contradicts the fact that n2 will neverreceive at least f + 1 PROPAGATE.

E. Protocol instance change

In this section we prove that if 2f + 1 correct nodesdecide to perform a protocol instance change, then a protocolinstance change will eventually be performed by everycorrect node. This proof refers to the algorithm presented inSection IV-D.

Invariant 5. If 2f + 1 correct nodes decide to performa protocol instance change, then a protocol instance changewill eventually be performed by every correct node.

Proof. By the contraposition: we suppose that 2f + 1correct nodes decide to perform a protocol instance change andthere exists a correct node that will never perform a protocolinstance change. We show that we arrive at a contradiction.

According to the algorithm, saying that 2f+1 correct nodesdecide to perform a protocol instance change is equivalent tosaying that 2f + 1 nodes send a valid INSTANCE CHANGE toall the nodes. Thus every correct node will eventually receive2f + 1 valid INSTANCE CHANGE.

We suppose there exists a correct node that will neverperform a protocol instance change. This means thereexists a correct node that will never receive 2f + 1valid INSTANCE CHANGE. This contradicts the fact thatevery correct node will eventually receive 2f + 1 validINSTANCE CHANGE.

VII. PERFORMANCE EVALUATION

In this section we present a performance analysis of RBFT.After describing the hardware and software settings we use, westart by comparing the performance of RBFT against the state-of-the-art robust BFT protocols described in Section III in thefault-free case. Second, we assess the monitoring precision ofRBFT. Finally, we present a performance analysis of RBFTunder attack.

Our evaluation makes the following points. We first showthat RBFT has comparable performance to state-of-the artprotocols in the fault-free case. Second, we show that under thetwo worst possible attacks, where faulty clients collude withfaulty replicas, the throughput degradation caused to RBFT islimited to 3%.

A. Experimental settings

We evaluate the performance of Prime, Aardvark, Spinningand RBFT on a cluster composed of eight Dell PowerEdgeT610 and two Dell Precision WorkStation T7400. The T610host two quad-core Intel Xeon E5620 processors clocked at2.40GHz with 16GB of RAM and ten network interfaces.The T7400 host two quad-core Intel Xeon E5410 processorsclocked at 2.33GHz with 8GB of RAM and five networkinterfaces. All these machines run a Linux kernel version2.6.32 and are interconnected via a Gigabit switch. We runexperiments with up to two Byzantine faults, i.e., f ≤ 2. Thenodes are always launched on the T610 machines. The remain-ing machines are used to run the clients. Unless specified, we

0

5

10

15

20

25

0 5 10 15 20 25 30 35 40 45

La

ten

cy in

ms

Throughput in kreq/s

RBFT w/ TCPRBFT w/ UDP

PrimeAardvarkSpinning

(a) With requests of 8B.

0

5

10

15

20

25

0 1 2 3 4 5 6 7

La

ten

cy in

ms

Throughput in kreq/s

RBFT w/ TCPRBFT w/ UDP

PrimeAardvarkSpinning

(b) With requests of 4kB.

Fig. 7: Latency vs. throughput for various robust BFT protocols (f = 1).

consider the TCP implementation of RBFT configured withf = 1.

We run our experiments under two different workloads:a static load, where the system is saturated and the clientssend their requests at a constant rate, and a dynamic work-load, where the incoming throughput varies. We present theworkload used for requests of 8B. Similar workloads havebeen used for the other requests sizes with possibly fewerclients as the peak throughput has been reached with fewerclients. The experiment starts with a single client. We thenprogressively increase the number of clients up to 10. Thenwe simulate a load spike, with 50 clients. At last, the numberof clients progressively decreases, until there is only one clientissuing requests. When the load is dynamic, we considerthe average throughput observed on the whole experiment. Asimilar workload has already been used in [9].

Finally, the clients send their requests in an open-loop asdefined in [15], i.e., they do not wait for the reply of a requestbefore sending a new one.

B. Fault-free case

In this section we present the fault-free performance ofRBFT, Prime, Aardvark and Spinning. We also show theperformance of RBFT when the communication protocol usedbetween the nodes is UDP instead of TCP. We run theexperiments under a static load, with requests of either 8Bor 4kB. Figures 7a and 7b present the latency achieved by thedifferent protocols as a function of the throughput, for requestsof 8B and 4kB, respectively.

Fist of all, we observe that Spinning provides the highestpeak throughput and the lowest latency. Specifically, with re-quests of 8B, its throughput is 20% higher than the throughputof RBFT and Aardvark, and 183% higher than the throughputof Prime. Similarly, with requests of 4kB the throughput ofSpinning is 30% higher than the throughput of RBFT. Its goodperformance is mostly due to the fact that it relies only onMACs, while the other protocols (RBFT, Prime and Aardvark)

use signatures in addition to MACs. Although signatures are anorder of magnitude more costly than MACs, they are necessaryto prevent certain attacks [6]. Moreover, Spinning uses UDPmulticast for both the communication between the replicas andthe communication between the clients and the replicas. Thesetwo properties allow Spinning to provide a very low latency.

The second observation we make is that the performanceof RBFT is higher than the performance of Aardvark. Forrequests of 8B, the peak throughput of RBFT is 35 kreq/s,while the peak throughput of Aardvark is 31.6 kreq/s. Simi-larly, for requests of 4kB, the peak throughput of RBFT is 5kreq/s, while the peak throughput of Aardvark is 1.7 kreq/s.This might seem surprising as the BFT protocol that is runby each protocol instance in RBFT mimics Aardvark and usesthe same code base. The reason why RBFT is more efficient isthat it does not perform regular view changes (remember thatview changes are replaced by protocol instance changes thatoccur only when there are faults). To confirm this explanation,we disabled the view changes in Aardvark and we obtainedthe same performance as RBFT for small requests. For biggerrequests, the better performance of RBFT is due to the fact thatthe protocol instances only order requests identifiers instead ofthe whole request. When they order the whole requests, thepeak throughput drops to 1.8 kreq/s for requests of 4kB.

Third, we observe that Prime provides the worst perfor-mance, especially in terms of latency. Indeed, its latency isan order of magnitude higher than the latency of the otherprotocols for both request sizes. This high latency is due tothe fact that the Prime protocol solely relies on signatures,which are known to be slower than MACs. Moreover, it is dueto the fact that in Prime, the primary does not send requestordering messages following the flow of arrival of requests,but periodically.

Finally, we observe that the UDP and TCP implementationsof RBFT exhibit the same peak throughput. The only differ-ence is that the UDP implementation provides a latency 22%lower (resp. 18%) than the TCP implementation, for requests

of 8B (resp. 4kB). This increase in latency is due to the mech-anisms TCP uses to enforce robustness (acknowledgements,flow control, etc.). Note that the choice of the communicationmechanism has no impact on the performance of RBFT underattack, as we found similar results while using TCP or UDP.

C. RBFT monitoring precision

RBFT detects that the primary of the master protocolinstance is faulty by periodically monitoring the performanceof the protocol instances. In this section, we analyse theprecision of this monitoring under both a static and a dynamicworkload. More specifically, we are interested in the impactof the monitoring period on the minimal throughput ratioobserved by the nodes, and on the distribution of the observedthroughput ratios. The minimal ratio allows one to detect ifthe primary of the master protocol instance is faulty or not.A node considers that the primary of the master protocolinstance is faulty as soon as the observed ratio is lower thana specific threshold. If the minimal ratio is too low in thefault-free case, then it will become more difficult to detect anattack. Furthermore, if the monitoring period is high, then anattacker will be able to cause damage to the system during along time. Finally, the distribution of the observed ratios assesstheir volatility.

Table V presents the minimal ratio observed respectivelyunder a static and a dynamic workload, for requests of 8B and4kB. We vary the monitoring period from 1ms up to 1 second.Note that a monitoring period greater than 1 second does notimprove the monitoring precision significantly. We observethat for a period of 1ms, which is lower than the systemlatency, the nodes cannot detect an attack as the minimalobserved ratio is −∞. As the monitoring period increases, theminimal ratio becomes closer to 0. That is to say, the nodesobserve, in the worst case, less difference between the perfor-mance of the protocol instances. For instance, with requestsof 8B and under a static load, it is of −7 with a monitoringperiod of 10ms, and only −0.003 with a monitoring period of1 second.

1ms 10ms 100ms 1s8B −∞ −7 −0.2 −0.0034kB −∞ −4 −0.1 −0.002

(a) Static workload.

1ms 10ms 100ms 1s8B −∞ −1 −0.07 −0.0044kB −∞ −0.7 −0.05 −0.005

(b) Dynamic workload.

TABLE V: RRBFT monitoring precision: minimal ratio ob-served for different monitoring periods, for requests of 8Band 4kB, under both a static and a dynamic workload.

Figure 8 presents the cumulative percentage of the observedratios for different monitoring periods in the fault-free casewith requests of 4kB and a static load. Results are similar withrequests of 8B and/or under a dynamic workload. We observethat, as the monitoring period increases, the distribution of

0

20

40

60

80

100

-1 -0.5 0 0.5 1

Pe

rce

nta

ge

of

va

lues

Ratio

1ms10ms

100ms1s

Fig. 8: Cumulative percentage of the observed ratios fordifferent monitoring periods in the fault-free case with requestsof 4kB and a static load.

observed ratios becomes closer to 0. This implies that thedifference between the observed performance of the differentprotocol instances becomes smaller, hence a faulty primaryat the master protocol instance has less liberty to attack thesystem.

In the rest of the evaluation, we set the monitoring periodto 1 second.

D. RBFT under attack

0

1

2

3

4

5

6

node 0 node 1 node 2

Thro

ughput (k

req/s

)

Master protocol instanceBackup protocol instance

Fig. 10: Throughput measured by the different nodes underworst attack 1 (f = 1, static workload, 4kB requests).

In this section, we first present the two worst attacks that canbe done to RBFT. These attacks show the maximal damage anattacker can make on the protocol in two different situations:(1) when the primary of the master instance is correct (worst-attack-1), and (2) when the primary of the master instance ismalicious (worst-attack-2). We consider these attacks as worstattacks in their respective context, because in each case, all thef malicious nodes and any number of malicious clients colludeto reduce the system performance. We run the experiments intwo configurations: f = 1 and f = 2. We show that themaximal throughput degradation is 3%. Finally, we show thatthe monitoring of the latency prevents a faulty primary at themaster instance not to be fair with a subset of the clients whenconstructing an ordering message.

90

92

94

96

98

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

(a) f = 1

90

92

94

96

98

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

(b) f = 2

Fig. 9: RBFT throughput under worst-attack-1 relative to the throughput in the fault-free case, for both a static and a dynamicload.

90

92

94

96

98

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

(a) f = 1

90

92

94

96

98

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Rela

tive

th

roug

hp

ut

(%)

Requests size (kB)

Static loadDynamic load

(b) f = 2

Fig. 11: RBFT throughput under worst-attack-2 relative to the throughput in the fault-free case, for both a static and a dynamicload.

1) Worst-attack-1: In this attack, there are f faulty nodesand all clients are faulty. The primary of the master protocolinstance is correct (i.e. it runs on a correct node). The goal ofthe attack is to decrease as much as possible the performanceof the master instance, in order to induce a protocol instancechange. If this attack is succesful, then it is possible formalicious nodes to put a malicious replica as the primary ofthe master instance. A faulty node can cause the followingdamages: first, it can flood other nodes; second, the replicasit hosts are also faulty and they can take actions to reducethe throughput of the protocol. Let p be the node on whichthe primary of the master protocol instance runs. The attackwe are performing is the following: (i) the clients (that areall faulty) send requests that can be verified by all nodes, butthe one on which the primary of the master protocol instanceruns; (ii) the f faulty nodes flood p with invalid PROPAGATEmessages of the maximal size; (iii) the faulty replicas of themaster protocol instance flood the correct ones with invalidmessages of the maximal size; (iv) the faulty replicas of themaster protocol instance do not take part in the protocol.

We run this experiment with a request size ranging from 8Bto 4kB. Figure 9 presents the throughput under attack relativeto the throughput in the fault-free case, under both the staticand dynamic load described earlier. We observe that RBFT isrobust: the throughput loss is below 2.2%, under a static loadand that it is null with the dynamic load, when f = 1. Whenat most two faults are tolerated, i.e., f = 2, we observe thatthe throughput loss is even lower: at most 0.4%.

In order to illustrate the behavior of RBFT, we registered thethroughput measures performed by the monitoring mechanismof the different nodes. Results are depicted in Figure 10 forthe static workload, with a request size of 4kB, and whenat most one fault is tolerated (we observed similar results inother configurations). This figure first shows that each nodemeasures the same throughput. Moreover, it shows that thethroughput measured for the master protocol instance is veryclose to the throughput measured for the backup protocolinstance (2% difference). Note that we do not represent thevalues reported by the monitoring module of node 3 as this isthe faulty node in this experiment (it can thus report arbitraryvalues).

2) Worst-attack-2: In this attack, there are f faulty nodesand all clients are faulty. The primary of the master protocolinstance is faulty (i.e. it runs on a faulty node). Faulty clientsand nodes aim at decreasing the performance of the backupprotocol instances in order to give a margin for the faultyprimary of the master protocol instance to delay requestswithout being detected. Towards this purpose, the faulty nodescollude with the faulty clients, and take the following actions:(i) the (faulty) clients send invalid requests to the correctnodes; (ii) the f faulty nodes flood the correct nodes withinvalid messages of the maximal size and do not participate inthe PROPAGATE phase; (iii) the replicas of the backup protocolinstances executing on the f faulty nodes flood the correct oneswith invalid messages of the maximal size, and do not takepart in the protocol.

0

1

2

3

4

5

6

node 1 node 2 node 3

Thro

ughput (k

req/s

)

Master protocol instanceBackup protocol instance

Fig. 12: Throughput measured by the different nodes underworst attack 2 (f = 1, static workload, 4kB requests).

We run this experiment with a request size ranging from 8Bto 4kB. Figure 11 presents the throughput under attack relativeto the throughput in the fault-free case, for both the staticand dynamic load described earlier. The malicious primary ofthe master instance decreases its throughput by delaying therequests, down to the limit value such that the throughput ratioobserved at the correct nodes is be greater or equal than ∆.Indeed, a lower ratio observed at the correct nodes impliesa protocol instance change. We again observe that RBFT isrobust: the maximum throughput loss is below 3% when f =1. It is even below when f = 2: less than 1%.

As for the first worst attack, we present in Figure 12 thethroughput measures that are performed by the monitoringmechanism running on the different nodes. Similarly to theprevious attack, we observe that all nodes measure the samethroughput. Moreover, we observe that the throughput mea-sured for the master protocol instance is almost similar tothe one measured for the backup instance. This explains whyRBFT is robust.

3) Unfair primary: In this attack, the primary of the masterinstance is not fair and proposes an ordering on the requestsof a given client less frequently than for the other clients. Weshow that the malicious primary cannot increase the latencyof the requests of a single client beyond a small limit. We runan experiment with f = 1, 2 clients and requests of 4kB. Themaximal acceptable latency, Λ, is 1.5ms. We also set a highvalue for Ω, the maximal acceptable difference between theaverage latency of a client on the different protocol instances.Consequently, we easily observe that the primary of the masterinstance cannot increase the latency of a client beyond Λ.

Figure 13 presents the latency of each request for bothclients. At the beginning and for 500 requests, the maliciousprimary, of the master protocol instance, acts normally. Theaverage latency is 0.8ms. Then it proposes an ordering forthe requests of the first client less frequently, so that theaverage latency observed for this client is 1.3ms, during 500more requests. At request 1000, it increases even more thelatency observed for this client. This single request has alatency of 1.6ms. As it is higher than the maximal acceptablelatency, the different nodes vote a Protocol Instance Change.As a result, the malicious primary of the master instance

is evicted and replaced by a correct replica. This correctreplica is fair and provides the same latency for the requestsof both clients. Note that in this experiment the throughputmonitoring mechanism did not triggered a protocol instancechange because the malicious primary of the master instanceprovided a similar throughput than the throughput of thebackup protocol instances.

0

0.5

1

1.5

2

2.5

0 200 400 600 800 1000 1200 1400

La

ten

cy (

ms)

Request number

Max acceptable latencyAttacked client

Other client

Fig. 13: Ordering latencies for the requests of two clients onthe master protocol instance with an unfair primary, whichstarts to delay the request of one of the clients after 500requests.

VIII. CONCLUSION

In this paper we have demonstrated that the state-of-the-art robust BFT protocols are not effectively robust. We haveshown that a malicious primary replica can drastically degradeperformance. To tackle the robustness problem, we haveproposed a new approach: RBFT, for Redundant ByzantineFault Tolerance. The key idea in RBFT is to run severalinstances of a BFT protocol in parallel, and to monitor theirperformance in order to detect a malicious primary. We haveshown that the performance of RBFT in the fault-free caseis equivalent to the performance of the state-of-the-art robustBFT protocols. Moreover, we have shown that RBFT is morerobust than existing protocols as it bounds the throughputdegradation caused by a malicious primary to 3%, even indrastic conditions where an unlimited number of maliciousclients collude with the f malicious nodes to harm the system.Not only we found a small performance loss for the casef = 1, but this loss is even smaller for f = 2. In our futurework, we plan to study how RBFT can be extended to dealwith closed-loop systems, i.e. systems in which clients wait forreplies to their previous requests before issuing new requests.

ACKNOWLEDGMENTS

We are grateful to the anonymous reviewers for the insight-ful comments that helped improve the paper. We would alsolike to thank F. Bongiovanni, R. Lachaize, B. Lepers and A.Pace for their helpful feedback on this work. This work waspartially funded by the French ANR SocEDA project.

REFERENCES

[1] Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson,Michael K. Reiter, and Jay J. Wylie. Fault-scalable Byzantine fault-tolerant services. In SOSP. ACM, 2005.

[2] Yair Amir, Brian Coan, Jonathan Kirsch, and John Lane. Byzantinereplication under attack. In DSN, pages 105–114, 2008.

[3] Fatemeh Borran and Andre Schiper. Brief announcement: a leader-freebyzantine consensus algorithm. DISC’09, 2009.

[4] Miguel Castro and Barbara Liskov. Practical byzantine fault tolerance.In OSDI, 1999.

[5] Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, LorenzoAlvisi, Mike Dahlin, and Taylor Riche. Upright cluster services. InSOSP, pages 277–290, New York, NY, USA, 2009. ACM.

[6] Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, and MircoMarchetti. Making byzantine fault tolerant systems tolerate byzantinefaults. In NSDI, pages 153–168, Berkeley, CA, USA, 2009. USENIXAssociation.

[7] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, andLiuba Shrira. HQ replication: a hybrid quorum protocol for Byzantinefault tolerance. In OSDI, 2006.

[8] Rui Garcia, Rodrigo Rodrigues, and Nuno Preguica. Efficient middle-ware for byzantine fault tolerant database replication. EuroSys, 2011.

[9] Rachid Guerraoui, Nikola Knezevic, Vivien Quema, and Marko Vukolic.The next 700 bft protocols. In EuroSys, pages 363–376, 2010.

[10] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed.Zookeeper: wait-free coordination for internet-scale systems. USENIXATC, 2010.

[11] M. Kaptritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, andM. Dahlin. Eve: Execute-verify replication for multi-core servers. InOSDI 2012, Oct 2012.

[12] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, andEdmund Wong. Zyzzyva: Speculative byzantine fault tolerance. ACMTrans. Comput. Syst., 27(4):1–39, 2009.

[13] Leslie Lamport. Lower bounds for asynchronous consensus, 2004.[14] John MacCormick, Nick Murphy, Marc Najork, Chandramohan A.

Thekkath, and Lidong Zhou. Boxwood: abstractions as the foundationfor storage infrastructure. OSDI, pages 8–8, 2004.

[15] Bianca Schroeder, Adam Wierman, and Mor Harchol-Balter. Openversus closed: a cautionary tale. NSDI, pages 18–18, Berkeley, CA,USA, 2006. USENIX Association.

[16] Paulo Sousa, Alysson Neves Bessani, Miguel Correia, Nuno FerreiraNeves, and Paulo Verissimo. Highly available intrusion-tolerant serviceswith proactive-reactive recovery. IEEE Trans. Parallel Distrib. Syst.,21(4):452–465, April 2010.

[17] Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, andLau Cheuk Lung. Spin one’s wheels? byzantine fault tolerance with aspinning primary. In SRDS, pages 135–144, 2009.

[18] Timothy Wood, Rahul Singh, Arun Venkataramani, Prashant Shenoy, andEmmanuel Cecchet. Zz and the art of practical bft execution. EuroSys,2011.


Recommended