+ All Categories
Home > Documents > BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

Date post: 26-Mar-2022
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
14
1 hBFT: Speculative Byzantine Fault Tolerance With Minimum Cost Sisi Duan, Sean Peisert, and Karl Levitt Abstract—We present hBFT, a hybrid, Byzantine fault-tolerant, replicated state machine protocol with optimal resilience. Under normal circumstances, hBFT uses speculation, i.e., replicas directly adopt the order from the primary and send replies to the clients. As in prior work such as Zyzzyva, when replicas are out of order, clients can detect the inconsistency and help replicas converge on the total ordering. However, we take a different approach than previous work that has four distinct benefits: it requires many fewer cryptographic operations, it moves critical jobs to the clients with no additional costs, faulty clients can be detected and identified, and performance in the presence of client participation will not degrade as long as the primary is correct. The correctness is guaranteed by a three-phase checkpoint subprotocol similar to PBFT, which is tailored to our needs. The protocol is triggered by the primary when a certain number of requests are executed or by clients when they detect an inconsistency. Index Terms—Byzantine, fault tolerance, state machine replication, performance, speculation 1 I NTRODUCTION A S distributed systems develop and grow in size, Byzantine failures generated by malicious at- tacks, and software and hardware errors must be tolerated. Byzantine agreement protocols are attrac- tive because they enhance reliability of replicated services in the presence of arbitrary failures. However, Byzantine protocols come at a cost of high overhead of messages and cryptographic operations. Therefore, protocols that can reduce overhead can be attractive building blocks. A number of existing protocols also reduce over- head on Byzantine agreement by moving some crit- ical jobs to clients [13, 17, 19, 21, 33, 34]. But these protocols come with trade-offs that we seek to avoid. Specifically, while they all provide better fault-free cases and reduce the message complexity, they sac- rifice the performance of normal cases and may even degrade the performance of fault-free cases. For instance, the Zyzzyva [21] protocol is able to use roughly half of the amount of messages and crypto- graphic operations that PBFT [7] requires. However, Zyzzyva’s performance can be even worse than PBFT if at least one backup fails. Additionally, these proto- cols simplify the design by involving clients in the agreement. However, they all require clients to be correct in order to achieve correctness. Therefore, our motivation for developing a new protocol is to improve performance over PBFT with- out being encumbered by some of these trade-offs. Specifically, we have three key goals: first, we wish to be able to show how critical jobs can be moved to the clients without additional costs. Second, we wish to tolerate Byzantine faulty clients. Third, we define the notion of normal cases, which means the primary is correct and the number of faulty backups does not exceed the threshold. We wish to provide better per- formance for both fault-free cases and normal cases. This paper presents hBFT, a leader-based protocol that uses speculation to reduce the cost of Byzantine agreement protocols with optimal resilience, utilizing n 3f +1 replicas to tolerate f failures. hBFT satisfies all of our stated goals. To accomplish this, hBFT employs several techniques. First, it uses speculation: backups speculatively execute requests ordered by the primary and replies to the clients. As a result, correct replicas may be temporarily inconsistent. hBFT em- ploys a three-phase PBFT-like checkpoint subprotocol for both garbage collection and contention resolution. The checkpoint subprotocol can be triggered by the replicas when they execute certain number of opera- tions, or by clients when they detect the divergence of replies. In this way replicas are able to detect any inconsistency through internal message exchanges. Even though the three-phase protocol is expensive, it is not triggered frequently. Eventually hBFT can ensure the total ordering of requests for all correct replicas with very low cost. 1.1 Motivation Our goal for hBFT is to offer better performance by moving some critical jobs to the clients while minimiz- ing side effects that can actually reduce performance in many cases in previous work [17, 21, 33, 34]. First, hBFT moves some critical jobs to the clients without additional costs. Moving critical jobs to the clients is effective in simplifying the design and reduc- ing the message complexity, partly because replicas do not need to run expensive protocols to establish the order for every request. Nevertheless, it does not necessarily make protocols more practical. Indeed, it may sacrifice performance in normal cases or even in
Transcript
Page 1: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

1

hBFT: Speculative Byzantine Fault ToleranceWith Minimum Cost

Sisi Duan, Sean Peisert, and Karl Levitt

Abstract—We present hBFT, a hybrid, Byzantine fault-tolerant, replicated state machine protocol with optimal resilience. Undernormal circumstances, hBFT uses speculation, i.e., replicas directly adopt the order from the primary and send replies to theclients. As in prior work such as Zyzzyva, when replicas are out of order, clients can detect the inconsistency and help replicasconverge on the total ordering. However, we take a different approach than previous work that has four distinct benefits: it requiresmany fewer cryptographic operations, it moves critical jobs to the clients with no additional costs, faulty clients can be detectedand identified, and performance in the presence of client participation will not degrade as long as the primary is correct. Thecorrectness is guaranteed by a three-phase checkpoint subprotocol similar to PBFT, which is tailored to our needs. The protocolis triggered by the primary when a certain number of requests are executed or by clients when they detect an inconsistency.

Index Terms—Byzantine, fault tolerance, state machine replication, performance, speculation

F

1 INTRODUCTION

A S distributed systems develop and grow in size,Byzantine failures generated by malicious at-

tacks, and software and hardware errors must betolerated. Byzantine agreement protocols are attrac-tive because they enhance reliability of replicatedservices in the presence of arbitrary failures. However,Byzantine protocols come at a cost of high overheadof messages and cryptographic operations. Therefore,protocols that can reduce overhead can be attractivebuilding blocks.

A number of existing protocols also reduce over-head on Byzantine agreement by moving some crit-ical jobs to clients [13, 17, 19, 21, 33, 34]. But theseprotocols come with trade-offs that we seek to avoid.Specifically, while they all provide better fault-freecases and reduce the message complexity, they sac-rifice the performance of normal cases and mayeven degrade the performance of fault-free cases. Forinstance, the Zyzzyva [21] protocol is able to useroughly half of the amount of messages and crypto-graphic operations that PBFT [7] requires. However,Zyzzyva’s performance can be even worse than PBFTif at least one backup fails. Additionally, these proto-cols simplify the design by involving clients in theagreement. However, they all require clients to becorrect in order to achieve correctness.

Therefore, our motivation for developing a newprotocol is to improve performance over PBFT with-out being encumbered by some of these trade-offs.Specifically, we have three key goals: first, we wishto be able to show how critical jobs can be moved tothe clients without additional costs. Second, we wishto tolerate Byzantine faulty clients. Third, we definethe notion of normal cases, which means the primaryis correct and the number of faulty backups does not

exceed the threshold. We wish to provide better per-formance for both fault-free cases and normal cases.

This paper presents hBFT, a leader-based protocolthat uses speculation to reduce the cost of Byzantineagreement protocols with optimal resilience, utilizingn ≥ 3f+1 replicas to tolerate f failures. hBFT satisfiesall of our stated goals. To accomplish this, hBFTemploys several techniques. First, it uses speculation:backups speculatively execute requests ordered by theprimary and replies to the clients. As a result, correctreplicas may be temporarily inconsistent. hBFT em-ploys a three-phase PBFT-like checkpoint subprotocolfor both garbage collection and contention resolution.The checkpoint subprotocol can be triggered by thereplicas when they execute certain number of opera-tions, or by clients when they detect the divergenceof replies. In this way replicas are able to detect anyinconsistency through internal message exchanges.Even though the three-phase protocol is expensive,it is not triggered frequently. Eventually hBFT canensure the total ordering of requests for all correctreplicas with very low cost.

1.1 Motivation

Our goal for hBFT is to offer better performance bymoving some critical jobs to the clients while minimiz-ing side effects that can actually reduce performancein many cases in previous work [17, 21, 33, 34].

First, hBFT moves some critical jobs to the clientswithout additional costs. Moving critical jobs to theclients is effective in simplifying the design and reduc-ing the message complexity, partly because replicasdo not need to run expensive protocols to establishthe order for every request. Nevertheless, it does notnecessarily make protocols more practical. Indeed, itmay sacrifice performance in normal cases or even in

Page 2: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

2

fault-free cases, e.g. the output commit in Zyzzyvaslows down both. hBFT achieves a simplified designand better performance for both fault-free and normalcases.

Second, hBFT can tolerate unlimited number offaulty clients. Previous protocols all rely on thecorrectness of clients. However, Byzantine clientscan dramatically decrease performance. For instance,in the protocols that switch between subproto-cols [17, 33, 34] (called abstracts in [17]), a faulty clientcan stay silent when it detects the inconsistency. Evenif the next client is correct and makes the protocolswitch to another abstract, replicas are still incon-sistent because of this “faulty request”. Similarly, inZyzzyva, faulty clients can keep silent when theyare supposed to send a commit certificate to makeall correct replicas converge. Faulty primaries in thiscase can not be detected, eventually leading to in-consistencies of replica states. Faulty clients can alsointentionally send commit certificates to all replicaseven if they receives 3f+1 matching messages, whichslows down the performance.

Third, hBFT has the same operations for both fault-free and normal cases. This shows that in leader-basedprotocols, when the primary is correct, all the requestsare totally ordered by all correct replicas. Previousprotocols all achieve impressive performance in fault-free cases while they employ different operationswhen failure occurs, resulting in lower performance.Although Zyzzyva5 [21] makes the faulty cases faster,it requires 5f+1 replicas to tolerate f failures. In hBFT,we achieve better performance in both normal fault-free and normal cases using 3f + 1 replicas.

2 RELATED WORK

Fig. 1 compares several features for normal cases be-tween BFT protocols, a selection of which are plottedin Fig. 9. We provide the values for fault-free cases inthe caption for protocols that have different values.The table is constructed based on the models to toler-ate f failures. We measure throughput using the num-ber of MACs each replica performs, including bothgeneration and verification. The latency is evaluatedby the number of communication steps (critical path),which refers to the number of one way latencies. Sincewe can batch concurrent requests for only the agree-ment subprotocol (as discussed in Section ??), thenumber of cryptographic operation between replicasand the clients becomes the major obstacle to through-put, especially when f grows. Compared to otherknown, prior work, hBFT achieves the lowest boundfor almost every feature under high concurrency, andcan handle faulty clients.

Most current practical Byzantine fault tolerant pro-tocols are developed based on PBFT [7], which is athree phase leader-based protocol. Several subsequentwork focus either on increasing the number of faults

systems can tolerate or on improving performance.There are trade-offs between the two. For instance,Fab [26] is a two-phase PBFT protocol that is provedto tolerate f failures by requiring at least 5f + 1replicas in total. The garbage collection in our protocoluses a tailored PBFT scheme, since it can guaranteecorrectness, but is too expensive to be used for fault-free and normal cases when the primary is correct.

Several protocols [17, 19, 21, 33, 34] move somecritical jobs to the clients to improve performance.Zyzzyva and its variant [19] move output committo the clients to reduce message complexity in fault-free cases. Other protocols [17, 33, 34] move the jobof switching of subprotocols to the clients. When onesubprotocol aborts, the protocol will switch to another.hBFT also switches between normal case operationand checkpoint subprotocol. However, hBFT does notorder any single request using checkpoint subpro-tocol, instead, it is used only for contention resolv-ing and garbage collection. Clients can facilitate theprogress in hBFT but clients do not need to provideany “proof” to replicas.

Byzantine quorum systems [1, 25] tolerate Byzan-tine faults under low concurrency. HQ [13] is a hybridquorum and Byzantine agreement protocol that alsouses a PBFT-like subprotocol to resolve contention.Compared to HQ, hBFT does not have an additionalgarbage collection scheme and it works well underhigh concurrency.

3 SYSTEM MODELWe consider a distributed system that tolerates amaximum of f faulty replicas using 3f + 1 replicasand unlimited number of faulty clients. We considerByzantine fault-tolerant replication problem, wherefaulty replicas and clients behave arbitrarily. In ad-dition, we assume independent node failures, whichcan be obtained through techniques such as N -versionprogramming [28].

Safety, which means requests are totally ordered bycorrect replicas, holds in any asynchronous system,where messages can be delayed, dropped or deliveredout of order. Liveness, which means correct clientseventually receive replies to their requests, is ensuredassuming partial synchrony [15]: synchrony holdsonly after some unknown global stabilization time,but the bounds on communication and processingdelays are themselves unknown.

Operations are executed in an atomic broadcastmodel, where correct replicas agree on the set ofrequests and the order of them. In the description thatfollows, when we refer to fault-free cases, we meanthere are no replica failures, and when we refer tonormal cases, we mean the primary is correct and thenumber of faulty backups is between 1 and f .

We use digital signatures, message authenticationcodes (MACs), and message digests to prevent spoof-ing and to detect corrupted messages. For a message

Page 3: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

3

PBFT [7] Q/U [1] HQ [13] FaB [26] Zyzzyva [21] hBFTCost Total replicas 3f + 1 5f + 1 3f + 1 5f + 1 3f + 1 3f + 1

Throughput Primary 2 + 8fb 2 + 8f‖ 4 + 4f‖ 1 + 5f

b 4 + 5f + 3fb

†2 + 3f

b

(MAC ops/request) Backup 2 + 8f+1b 2 + 8f‖ 4 + 4f‖ 1 + 2f+2

b 4 + 5f + 1b

∗2 + 3f

b

Client 2 + 2f 2 + 8f 4 + 4f 1 + 5f 4 + 10f‡ 2 + 6f

1-way Latency Critical path 4 2 4 3 5§ 3

Works well on concurrency? Yes No No Yes Yes YesHandle faulty clients? No No No No No Yes

Fig. 1. Comparison of BFT protocols in normal cases tolerating f faults and using batch size b. †Fault-freecases: 2 + 3f

b . ∗Fault-free cases: 2 + 1b . ‡Fault-free cases: 2 + 6f . §Fault-free cases: 3. ‖Q/U and HQ are leader-

free quorum systems that do not differentiate primary and backups.

m, 〈m〉i denotes the message with digital signaturesigned by replica pi, D(m) denotes the message digest,and 〈m〉 denotes the message with MAC µi,j(m). TheMAC µi,j(m) is generated using secret key shared byreplica pi and pj .

View Changes-Elect a new primary

Checkpoint (3 phases)-Garbage collection

-Contention resolution

Agreement (2 phases)-Speculative execution

-Same fault-free and normal cases

Replica executesa number

of requests

Replica times out

Primary sends <New View>

Done withCheckpoint

Client sends <Panic>

Fig. 2. Layered Structure of hBFT

4 THE hBFT PROTOCOLThe hBFT protocol is a hybrid, replicated statemachine protocol. It includes four major compo-nents: (1) agreement, (2) checkpoint (3) view change,(4) client suspicion. As illustrated in Fig. 2, we employa simple agreement protocol for fault-free and normalcases, and use a three-phase checkpoint subprotocolfor contention resolution as well as garbage collection.The checkpoint subprotocol can be triggered by repli-cas when they execute a certain number of requests orby clients when they detect the divergence of replies.The view change subprotocol ensures the livenessof the system and can coordinate the change of theprimary. View changes can occur during normal oper-ations or in the checkpoint subprotocol. In both cases,the new primary initializes a checkpoint subprotocolimmediately and resumes to the agreement protocoluntil a checkpoint becomes stable. The client suspicionsubprotocol prevents faulty clients from attacking thesystem.

client

primary

replica

replica

replica

1

2

3

(a) Fault-free Case

client

primary

replica

replica

replica

1

2

3

2f+1 2f+1

(b) Normal Case

Fig. 3. Fault-free and Normal Cases of Zyzzyva

Why another speculative BFT protocol?hBFT uses speculation that overcomes some prob-

lems Zyzzyva experiences. Zyzzyva [21] also usesspeculation and it moves output commit to the clientsto enhance the performance. If we replace digitalsignatures with MACs and batch concurrent requestsin Zyzzyva, the performance degrades in normal casesand even fault-free cases. Fig. 3 illustrates the behav-ior of Zyzzyva [21]. Replicas speculatively execute therequests and respond to the client. The client collects3f + 1 matching responses to complete the request. Ifthe client receives between 2f + 1 and 3f matchingresponses, it sends a commit certificate to all replicas,which contains the response with 2f + 1 signatures.This helps replicas converge on the total ordering.However, a commit certificate must be verified byevery other replica, which causes computing overheadfor both clients and replicas. The use of MACs insteadof digital signatures makes Zyzzyva perform evenworse than PBFT under certain configurations1. For a

1. The use of MACs instead of digital signatures makes protocolsmuch faster. As mentioned in Aardvark [11], on a 2.0GHz Pentium-M, openssl 0.9.8g can compute over 500,000 MACs per second for 64byte messages, but it can only verify 6455 1024-bit RSA signaturesper second or produce 309 1024-bit RSA signatures per second.

Page 4: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

4

client

primary

replica

replica

replica

1

2

3

Fig. 4. The Agreement Protocol

reply message r by replica pi, 〈r′, µi,c(r′)〉must be sent

to the client, where r′ = 〈r, µi,1(r), µi,2(r) · · ·µi,n(r)〉and µx,y(r) denotes the MAC generated using secretkey shared by px and py . Therefore, every replicamust include 3f + 1 MACs for every reply mes-sage (compared with 1 if digital signatures are used).The performance is therefore dramatically degraded.Assuming b is the batch size, the primary must per-form 4 + 5f + 3f

b MACs in normal cases, which areeven worse than the 2 + 8f

b MACs for PBFT for someb and f . Thus in hBFT, we seek to avoid this problem.

4.1 Agreement Protocol

The agreement protocol orders requests for execu-tion by replicas. The algorithms for the primary,backups, and clients are defined in Algorithm 1 toAlgorithm 3. As illustrated in Fig. 4, a client c invokesthe operation by sending a m = 〈Request, o, t, c〉c toall replicas where o is the operation, t is the localtimestamp. Upon receiving a request, the primary piassigns a sequence number seq and then sends out a〈Prepare, v, seq ,D(m),m, c〉 to all replicas, where v isthe view number, D(m) is the message digest.

A 〈Prepare〉 message will be accepted by a backuppj provided that:

It verifies the MAC;The message digest is correct;It is in view v;seq = seql +1, where seql is the sequence numberof its last accepted request;It has not accepted a 〈Prepare〉 message of thesame view and sequence number but a differentrequest.

If a backup pj accepts the 〈Prepare〉 message, itspeculatively executes the operation and sends a replymessage 〈Reply, v, t, seq , δseq, c〉 to c and also a com-mit message 〈Commit, v, seq , δseq ,m,D(m), c〉 to allother replicas except the primary, where δseq containsthe speculative execution history.

In order to verify the correctness of the speculativeexecuted request, a replica collects 2f + 1 matching〈Commit〉 messages from other replicas to complete arequest. If a replica receives f +1 matching 〈Commit〉messages from different replicas but has not acceptedany 〈Prepare〉 message, it also speculatively execute

the operation, sends a 〈Commit〉 message to all repli-cas, and sends a reply to the corresponding client.When the replica collectes 2f+1 matching messages, itputs the corresponding request in its speculative exe-cution history. However, it is possible that a replica re-ceives f + 1 matching 〈Commit〉 messages from otherreplicas that are different from its accepted 〈Prepare〉message. Under such circumstance, the replica cansimply send a 〈View-Change〉 message to all replicas.If a replica votes for view change, it stops receiv-ing any messages except the 〈New-view〉 and thecheckpoint messages. See Section 4.3 for the detail ofview change subprotocol. This is to ensure that if atleast f + 1 correct replicas speculatively executes arequest, all the correct replicas learn the result. If anyother correct replicas receive inconsistent messages,the primary must be faulty and the replicas stopreceiving messages until view change occurs.

A client sets a timeout for each request, if it gathers2f + 1 matching speculative replies from differentreplicas before the timeout expires, it completes therequest. If a client receives fewer than f + 1 matchingreplies before the timeout expires, it retransmits therequests. Otherwise, when contention occurs (clientreceives between f+1 and 2f+1 matching replies be-fore timeout expires), client can facilitate the progressby sending a 〈PANIC, D(m), t, c〉c message to all repli-cas. If a replica receives a 〈PANIC〉 message, it for-wards the message to all replicas. If a replica doesnot receive any 〈PANIC〉 message from the client butreceives a 〈PANIC〉message from other replicas, it for-wards the 〈PANIC〉message to all replicas. A 〈PANIC〉message is valid if a replica has speculatively executedm. If a replica accepts a 〈PANIC〉 message, it stopsreceiving any messages except the view change andcheckpoint messages. If the primary does not initializethe checkpoint subprotocol, the replica votes for viewchange. The forwarding of 〈PANIC〉 messages aims attwo goals. On the one hand, this prevents checkpointprotocol from happening too frequently. Namely, allthe correct replicas receive the 〈PANIC〉 message be-fore the checkpoint protocol is triggered. On the otherhand, this preventing the clients from attacking thesystem by sending 〈PANIC〉 message to a portion ofreplicas. If a fault client sends a 〈PANIC〉 message toa correct backup, the replica will stop receiving anymessages while other replicas still continue th agree-ment protocol. This forwarding mechanism ensuresthat if at least one correct replica receives the 〈PANIC〉message, all the replicas receive the 〈PANIC〉 messageand enter the checkpoint protocol.

The primary initializes the checkpoint subprotocolif it receives the 〈PANIC〉 message from the client or2f + 1 〈PANIC〉 messages from other replicas. Thecorrectness of the protocol is therefore guaranteed bythe three-phase checkpoint subprotocol.

The panic mechanism facilitates the progress whenthe primary is faulty. Specifically, in a partial syn-

Page 5: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

5

chrony model where the value of client’s timeout isproperly set up, if a correct client does not receiveenough matching replies before timer expires, theprimary either sends inconsistent 〈Prepare〉 messagesto the replicas or fails to send consistent messages tothe replicas. In this case, instead of using traditionalapproach where replicas detect themselves by waitingfor longer period of time, the client can directly triggerthe checkpoint protocol in order to verify the correct-ness of the primary. See Section 4.2 for the detail ofcheckpoint subprotocol.

Algorithm 1 Primary1: Initialization:2: A {All replicas}3: seq ← 0 {Sequence number}

4: on event 〈Request, o, t, c〉c5: seq ← seq + 16: send 〈Prepare, v, seq ,D(m),m, c〉 to A7: send 〈Reply, v, t, seq , δseq, c〉 to c

Algorithm 2 Backup1: Initialization:2: A {All replicas}3: cnt← 0 {Counter of 〈Commit〉 messages}4: seqi ← 0 {Sequence number}

5: on event 〈Request, o, t, c〉c6: send 〈Request, o, t, c〉c to the primary

7: on event 〈Prepare, v, seq ,D(m),m, c〉8: if seq = seqi + 1 then9: seqi ← seq

10: send 〈Commit, v, seq , δseq ,m,D(m), c〉 to A11: send 〈Reply, v, t, seq , δseq, c〉 to c

12: on event 〈Commit, v, seq , δseq ,m,D(m), c〉13: cnt← cnt+ 114: if cnt = f + 1 and seq = seqi + 1 then15: seqi ← seq16: send 〈Commit, v, seq , δseq ,m,D(m), c〉 to A17: send 〈Reply, v, t, seq , δseq, c〉 to c18: if cnt = 2f + 1 then19: cnt← 0 {Complete the request}

hBFT guarantees correctness by only having twophases. If the client has received 2f + 1 matchingreplies, at least f+1 correct replicas receive consistentorder from the primary. Therefore, all correct replicasreceive at least f + 1 matching 〈Commit〉 messages. Ifthose replicas do not receive the 〈Prepare〉 message,they will execute the request. Otherwise, if they detectthe inconsistency, they stop receiving any messagesuntil the current primary is replaced or the checkpointsubprotocol is triggered. In the latter case, the incon-sistency will be reflected and fixed in the checkpointsubprotocol.

Algorithm 3 Client1: Initialization:2: A {All replicas}3: cnt← 0 {Counter of reply messages}4: send 〈Request, o, t, c〉c to A5: start(∆) {Start a timer}

6: on event 〈Reply, v, t, seq , δseq, c〉7: cnt← cnt+ 18: if cnt = 2f + 1 then9: cnt← 0 {Complete the request}

4.2 Checkpoint

We use a three-phase PBFT-like checkpoint proto-col. The reasons are three-fold. First, the agreementprotocol uses speculative execution and replicas maybe temporarily out of order. The three-phase check-point protocols resolve the inconsistencies. Second, ifa correct client triggers checkpoint protocol throughpanic mechanism, the checkpoint protocol resolvesthe inconsistencies immediately. Third, the checkpointprotocol detects the behavior of the faulty clients ifthey intentionally trigger the checkpoint protocol.

The checkpoint protocol works as follows. Onlythe primary can initialize the checkpoint subprotocol,which is generated under either of the two conditions:

the primary executes certain number of requests.the primary receives 〈PANIC〉 message from theclient or receives 2f+1 forwarded 〈PANIC〉 mes-sages from other replicas.

In the latter condition, as mentioned in Section 4.1,when replicas receive a valid 〈PANIC〉 message, itforwards to all replicas. To goal is to ensure that allthe replicas receive the 〈PANIC〉 message and preventthe faulty clients from sending a 〈PANIC〉 message toall backups so that replicas will suspect the primaryeven if it is correct.

The three-phase checkpoint subprotocol works asfollows: the current primary pi sends a 〈Checkpoint-I, seq ,D(M)〉 to all replicas, where seq is the sequencenumber of last executed operation, D(M) is the mes-sage digest of speculative execution history M . Uponreceiving a well-formatted 〈Checkpoint-I〉 message,a replica sends a 〈Checkpoint-II, seq ,D(M)〉 to allreplicas. If the digest and execution history do notmatch its local log, the replica sends a 〈View-Change〉message directly to all replicas and stops receivingany messages other than the 〈New-view〉 message.

A number of 2f + 1 matching 〈Checkpoint-II〉 mes-sages from different replicas form a certificate, de-noted by CER1(M, v). Any replica pj that has thecertificate sends a 〈Checkpoint-III, seq ,D(M)〉j toall replicas. Similarly, 2f + 1 〈Checkpoint-III〉 mes-sages form a certificate, denoted by CER2(M,v). Aftercollecting CER2(M, v), the checkpoint becomes sta-ble. All the previous checkpoint messages, 〈Prepare〉,

Page 6: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

6

〈Commit〉, 〈Request, o, t, c〉c, and 〈Reply〉 messageswith smaller sequence number than the checkpointare deleted.

If a view change occurs in checkpoint subprotocol,as described in Section 4.3, the new primary initializesa checkpoint immediately after the 〈New-view〉 mes-sage. The same three-phase checkpoint subprotocolcontinues until one checkpoint is completed and thesystem stabilizes.

4.3 View ChangesThe view change subprotocol elects a new primary.By default, the primary has id p = v mod n, where nis the total number of replicas, and v is the currentview number. View changes may take place in thecheckpoint subprotocol or the normal operations. Inboth cases, the new primary reorders requests using a〈New-view〉message and then initializes a checkpointimmediately. The checkpoint subprotocol continuesuntil one checkpoint is committed.

A 〈View-Change, v + 1,P,Q,R〉i message will besent by a replica if any of the following conditionsis true, where P contains the execution history Mfrom CER1(M, v) the replica collected in previousview v, Q denotes the execution history from theaccepted 〈Checkpoint-I〉 message, and R denotes thespeculatively executed requests with sequence num-ber greater than its last accepted checkpoint:

It starts a timer for the first request in the queue.The request is not executed before the timerexpires;It starts a timer after collecting f + 1 〈PANIC〉messages. It has not received any checkpointmessages before the timer expires;It starts a timer after it executes certain numberof requests. It has not received any checkpointmessages before the timer expires.It receives f + 1 valid 〈View-Change〉 messagesfrom other replicas.

Timers with different values are set for each case andare reset periodically.

When the new primary pj receives 2f 〈View-Change〉 messages, it constructs a 〈New-view〉 mes-sage to order all the speculatively executed requests.The system then moves to a new view. The principleis that any request committed by the clients must becommitted by all correct replicas. The new primarypicks up an execution history M from the P and aset of requests from the R of checkpoint messages.To select a speculative execution history M , there aretwo rules.A If some correct replica has committed on one check-

point that contains execution history M , M mustbe selected, provided that:A1. At least 2f + 1 replicas have CER1(M,v).A2. At least f + 1 replicas have accepted〈Checkpoint-I〉 in view v′ > v.

B If at least 2f+1 replicas have empty P components,then the new primary its last stable checkpoint.

Similarly, for each sequence number greater thanthe execution history M and smaller than the largestsequence number in R of checkpoint messages, theprimary assigns a request according to R. A requestm is chosen if at least f + 1 replicas include it inR of their checkpoint messages. Otherwise, NULLis chosen. We claim that it is impossible for f + 1replicas to include one request m, and another f + 1replicas include m′ with the same sequence number.Namely, if f + 1 replicas include a request m, at leastone correct replica receives 2f+1 〈Commit〉messages.Similarly, at least one correct replica receives 2f + 1commit messages with request m′. The two quorumsintersect in at least one correct replica. The correctreplica must have sent both 〈Commit〉 message withm and 〈Commit〉 message with m′, a contradiction.

The execution history M and the set of requestsform M ′, which is composed of requests with se-quence numbers between the last stable checkpointand the sequence number that has been used byat least one correct replica. The new primary thensends a 〈New-View, v + 1,V,X ,M ′〉j message to allreplicas, where V contains f + 1 valid 〈View-Change〉messages, X contains the selected checkpoint. Thereplicas then run the checkpoint subprotocol usingM ′. The checkpoint subprotocol continues until onecheckpoint is committed.

4.4 Client SuspicionFaulty clients may degrade the performance of thesystem, especially for protocols that move some crit-ical jobs to the clients. In hBFT, unlimited numberof faulty clients can be detected. We would like tofocus on the unfaithful but “legal” messages a faultyclient can craft to slow down the performance or causeincorrectness. To be specific, a faulty client can do thefollowing:

It sends inconsistent requests to different replicas.The primary may not be able to order “every”request before the timeout expires. In this case, acorrect primary may be removed.It intentionally sends 〈PANIC〉 messages whilethere is no contention. Unnecessary checkpointsubprotocol will be triggered, which slows downthe performance. However, if the client frequentlytriggers “valid” checkpoint operations, the over-all throughput degrades too.It does not send 〈PANIC〉 messages if it receivesdivergent replies, which leaves replicas temporar-ily inconsistent.

The client suspicion subprotocol in hBFT focus onthe first two. If the third one occurs, the checkpointsubprotocol can be triggered by the next correct clientif it detects the divergence of replies or by the primarywhen replicas execute certain number of requests.

Page 7: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

7

To solve the first problem, we ask clients to mul-ticast the request to the replicas and every replicaforwards the request to the primary. The primaryorders a request if it receives the request or if itreceives f + 1 matching requests forwarded by back-ups. If a replica pi receives a 〈Prepare〉 message witha request that is not in its queue, it still executesthe operation. Nevertheless, such faulty behavior ofclients will be identified as suspicious, and if thenumber of suspicious incidents from the same clientexceeds certain threshold, pi will send a 〈Suspect, c〉imessage to all replicas.

Another reason clients send their requests to allreplicas is that there are many drawbacks clients sendrequests to the primary only2. For instance, a faultyprimary can delay any request, no matter it receivesfrom the client or other replicas, which finally makesall clients multicast their requests to all replicas andexperience long latency. Faulty primary can performperformance attack such as timeout manipulation dis-cussed in previous works [2, 11, 29]. Furthermore, it isalso difficult to make clients keep track of the primary.If the client sends its request to a faulty backup, thefaulty backup can also ignore this request although itis supposed to forward the request to the primary. Allthese problems move the correctness of the protocolsto the detection of faulty replicas.

For the second problem where a faulty client in-tentionally sends a 〈PANIC〉 message to the replicasto trigger the checkpoint subprotocol, the protocolnaturally detects the faulty behavior. Intuitively, if therequest is committed in both agreement protocol andcheckpoint protocol without view change, the clientcan be suspected. Nevertheless, a correct client mightbe suspected as well. For instance, the following twocases are indistinguishable.

The replicas are correct and reach an agreement inthe agreement protocol. When they receiving the〈PANIC〉message from a faulty client, the requestis committed without view change and the clientis suspected.The primary is faulty and the client is correct. Theprimary sends the request to f + 1 correct replicaand another fake request to the left f correctreplicas. The f correct replicas will not execute threquest. When the replicas receive 〈PANIC〉 mes-sage and starts checkpoint protocol. The f faultyreplicas collude and make the request committed.Although the f correct replicas learn th resultand remain consistent, the correct client will besuspected.

2. In some Byzantine agreement protocols, clients send requeststo only their known primary. If a backup receives the request, itforwards the request to the primary, expecting the request to beexecuted. The client sets a timeout for each request it has. If it doesnot receive enough number of matching responses before timeoutexpires, it retransmits the request to all replicas.

To distinguish the above two cases, we modify theagreement protocol by simply replacing the MAC of〈Prepare〉 message with digital signatures, which iscalled Almost-MAC-agreement. When a replica sends a〈Commit〉 message, it appends the 〈Prepare〉 message.If a client does not receive valid 〈Prepare〉 messagefrom the primary but receives from other replicas, itstill execute the requests, sends 〈Commit〉 messagesto other replicas and send reply to the client. Oth-erwise, if a replica receives two valid and conflicting〈Prepare〉messages, it directly sends inconsistent mes-sages to all replicas and votes for view change. Asproved in Claim 2, the protocol guaranteed that cor-rect clients will not be removed. This optimization canalso solves the problem as discussed in Section 5.1.

The modification of agreement protocol results in2 + 1(sig)

b cryptographic operations for the primary.To reduce the overall cryptographic operations, hBFTswitches between the agreement protocol and Almost-MAC-agreement when executing certain number ofrequests.

The client will only be suspected when replicasare running Almost-MAC-agreement. In addition, theclient must be suspected by 2f + 1 replicas to beremoved. If the number of such incidents exceedscertain threshold, replicas will suspect the client andsend a 〈Suspect〉 message to all replicas. Similar toview change subprotocol, if a replica receives f + 1〈Suspect〉 messages, it generates a 〈Suspect〉 messageand sends to the replicas. If a replica receives 2f + 1〈Suspect〉messages, indicating that at least one correctreplica suspects the client, the client can be preventedfrom accessing the system in the future.

Worst Case Scenario We would like to analyze theworst case where a correct client can be suspected,mainly due to the network failure. It happens if anyof the following is true:(1) The request from client fails to reach f + 1 correct

backups before the backups receive the 〈Prepare〉message. In this case, since the f + 1 correct back-ups do not receive the request in the 〈Prepare〉message, they will suspect the client.

(2) Reply message(s) from correct replica(s) fails toreach the client before the timeout expires. Sincethe client does not receive 2f + 1 matchingreplies before the timeout expires, the client sends〈PANIC〉 messages while there is no contention.

The latter condition may occur due to inappropriatevalue of the timeout regarding the network conditionor due to the attack by the primary. For instance,a faulty primary can intentionally delay 〈Prepare〉message for some correct replicas, making correctclients send 〈PANIC〉message while replicas are “con-sistent”. However, if the value of the timeout is ap-propriately set up, as proved in Claim 2, a correctclient will not be removed. We set up a large enoughtimeout for clients so that if at least f + 1 correct

Page 8: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

8

replicas execute a requests, all the replicas will sendreply messages to the client before its timer expires.

4.5 Correctness

In this section, we sketch proofs for the safety andliveness properties of hBFT under optimal resilience.For simplicity, we assume there are 3f + 1 replicas.

4.5.1 Safety

Theorem 1 (Safety): If requests m and m′ are com-mitted at two correct replicas pi and pj , m is com-mitted before m′ at pi if and only if m is committedbefore m′ at pj .

Proof: The proof proceeds as follows. We firstprove the correctness of checkpoint subprotocol,which follows the correctness of PBFT, as shown inClaim 1. We then show the proof of the theorem basedon the claim.

Claim 1 (Safety of Checkpoint): The checkpoint sub-protocol guarantees the safety property.

Proof: We now prove that if checkpoints M andM ′ are committed at two correct replicas pi and pjin checkpoint subprotocol, regardless of being in thesame view or across views, M = M ′.

(Within one view) If pi and pj commit both in view v,then pi has collected CER2(M,v), which indicates thatat least f+1 correct replicas send 〈Checkpoint-III〉 forM . Similarly, pj has CER2(M ′, v), which indicates thatat least f+1 correct replicas send 〈Checkpoint-III〉 forM ′. Then excluding f faulty replicas, if M and M ′

are different, at least one correct replica has sent twoconflicting messages for M and M ′, which contradictswith our assumption. Therefore, M = M ′.

(Across views) If M is committed at pi in view vand M ′ is committed at pj in view v′ > v, M = M ′.If M ′ is committed in view v′, then either condi-tion A or B must be true in the construction ofthe 〈New-view〉 message in view v′ (see Section 4.3).However, if M is committed at pj in view v, pjhas CER2(M,v), which indicates that at least f + 1correct replicas have CER1(M,v) and M in the Pcomponent. Therefore, condition B cannot be true.For condition A, pj can commit on M ′ in view v′ ifboth A1 and A2 are true. A2 can be true if a faultyreplica sends a 〈View-Change〉 message that includes〈M ′, D(M ′), v1〉, where v < v1 < v′. However, con-dition A1 requires that at least f + 1 correct replicashave CER1(M ′, v′). Since at least f+1 correct replicashave CER1(M,v), they will not accept M ′ in any laterviews. At least one correct replica sends conflictingmessage(s), which contradicts with our assumption.Therefore, M = M ′.

To prove Theorem 1, we first show that if tworequests m and m′ are committed at correct replicaspi and pj , m equals m′. Then we show that if m1 iscommitted before m2 at pi, m1 is committed before

m2 at pj . The former part is shown across views andwithin the same view.

(Withing the same view) There are three cases:the two requests are committed in agreementsubprotocol, two requests are both committed incheckpoint subprotocol, one of them is committedin checkpoint subprotocol. In the first case, if m iscommitted at pi, pi receives 2f+1 〈Commit〉 mes-sages if the request is committed in agreementprotocol or 2f + 1 checkpoint messages as cer-tificate if the request is committed in checkpointprotocol. On the other hand, if m′ is committed atpj , pj receives 2f+1 〈Commit〉messages or 2f+1checkpoint messages. The two quorums intersectin at least one correct replica. The correct replicamust send inconsistent messages, a contradiction.Therefore, m equals m′.(Across views) If m is committed at replica pj ,2f + 1 replicas send 〈Commit〉 messages. At leastf + 1 correct replicas accept m, which will beincluded in their 〈View-Change〉 messages. Onevery view change, the new primary initializes acheckpoint subprotocol to make the same orderof requests committed at all the correct replicas inthe 〈New-view〉 message. The correctness followsfrom Claim 1.

Then we show that if m1 is committed before m2

at pi, m1 is committed before m2 at pj . If a request iscommitted at a correct replica, 2f + 1 replicas send〈Commit〉 messages. Since two quorums of 2f + 1replicas intersect in at least one correct replica pi,m1 is committed with sequence number smaller thanm2. According to the former proof, if m1 and m2 arecommitted at pj , they are committed with the samesequence numbers.

By combining all the above, safety is proved.

4.6 LivenessTheorem 2 (Liveness): Correct clients eventually re-

ceive replies to their requests.Proof: It is trivial to show that if the primary is

correct, clients receive replies to their requests. In thefollowing, we first show that correct clients will notbe removed. We then prove that faulty replicas andfaulty clients cannot impede progress by removing acorrect primary.

Claim 2 (Correct Client Condition): If the values ofthe timeouts are appropriately set up, correct clientswill not be removed if they trigger a checkpoint.

Proof: If a correct client receives between f + 1to 2f + 1 matching replies for a request m, it trig-gers the checkpoint subprotocol. To remove a cor-rect client, m must be executed by f + 1 replicasin Almost-MAC-agreement protocol and committedin the checkpoint subprotocol without view change.Among the f + 1 replicas that accept 〈Prepare〉 mes-sage in the agreement protocol, at least one is cor-rect. If it receives a 〈Prepare〉 message, it appends

Page 9: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

9

the message to 〈Commit〉 message and send to allreplicas. If at least one correct replica receive validand conflicting 〈Prepare〉 message from the primary,it will send inconsistent messages and eventually allthe correct replicas vote for view change, a contradic-tion that view change does not occur. Therefore, notcorrect replica receives conflicting 〈Prepare〉 message.In addition, if a correct replica does not received valid〈Prepare〉 message from the primary and receivesvalid 〈Prepare〉message appended with the 〈Commit〉message, it will accept the 〈Prepare〉 message andsends 〈Reply〉 message to the client. In this case, theclient receives 2f+1 matching replies, a contradictionwith the assumption that the client is correct. There-fore, correct clients will not be removed by the clientsuspicion protocol.

Claim 3 (Faulty Replica Condition): Faulty replicascannot impede progress by causing view changes.

Proof: To begin with, we show that faulty repli-cas cannot cause a view change by sending 〈View-Change〉 message. At least f + 1 〈View-Change〉 mes-sages are sufficient to cause a view change. Even ifall faulty replicas vote for view change, they cannotcause a view change. A faulty primary can cause aview change. However, the primary cannot be faultyfor more than f consecutive views.

In addition, no 〈View-Change〉 message makes acorrect primary incapable of generating a 〈New-view〉message. For one thing, a correct primary is ableto pick up a stable checkpoint. Since at least f + 1correct replicas have CER2 for a checkpoint, the newprimary is able to pick it up. For another thing,the new primary is able to pick up a sequence ofrequests based on condition A or B. Either some cor-rect replica(s) commits on a checkpoint or no correctreplica does. Condition A1 can be verified becausenon-faulty replicas will not commit on two differentcheckpoint. Condition A2 is satisfied if at least onecorrect replica accepts a 〈Checkpoint-I〉 message forthe same checkpoint and it votes for the authenticityof the checkpoint. Therefore, the checkpoint can beselected since it is authentic. Similarly, a set of ex-ecuted request can be selected based on R in viewchange. Namely, if the client completes a request, therequest must be ordered and accepted by at least2f+1 replicas. Among them, at least f+1 replicas arecorrect. If other replicas receive inconsistent 〈Prepare〉message and f+1 〈Commit〉messages, they will abort.Therefore, it is not possible that a set of f + 1 replicasinclude one request and another set of f + 1 replicasinclude another request. In conclude, the new primaryis able to select a 〈New-view〉 message.

Claim 4 (Faulty Client Condition 2): A faulty clientcannot impede progress by causing view changes.

Proof: If a faulty client intentionally triggers thecheckpoint subprotocol while replicas are consistent,a view change will not occur. Replicas will take anadditional three-phase checkpoint subprotocol so that

requests committed in agreement subprotocol willbe committed in checkpoint subprotocol. Since suchfaulty behavior of clients will be detected, the clientwill be removed.

To summarize, faulty backup(s) can not cause viewchanges according to Claim 3. In addition, faultyclients are not capable of removing a correct primaryaccording to Claim 4. Since faulty clients are eventu-ally moved, replicas will handle requests from correctclients. Finally, since the primary cannot be faultyfor more than f continuous views, clients eventuallyreceive replies to their requests.

5 DISCUSSION

5.1 Timeouts

Existing protocols rely on different timeouts to pro-vide liveness. As discussed in Section 4.4, the valuesof timeouts are key to avoid some uncivil attacks.Since we assume the weak synchrony model, it isreasonable to set up timeouts according to the round-trip time such as the technique used in Prime [2].However, in several corner cases inappropriate valuesof timeouts or network congestion makes a correctreplica suspect or remove a correct primary.

hBFT employs a client suspicion subprotocol thatis used to detect faulty clients. A faulty primary canplay tricks on timeouts to remove faulty clients. Forinstance, the primary can send 〈Prepare〉 message tof correct replicas and delay the 〈Prepare〉 message tof + 1 correct replicas until the very end of timeoutof the client. The f + 1 correct replicas receive the〈Prepare〉 message and execute the request but theydo not reply to the clients “on time”. Since the clientdoes not receive enough number of replies beforethe timeout expires, it sends 〈PANIC〉 message. How-ever, all replicas are “consistent” since the primarystill sends out consistent 〈Prepare〉 messages. Correctclients will be suspected.

We solve this problem by using Almost-MAC-agreement protocol as discussed in Section [?]. Theoptimization allows all replicas to execute the requeston time if at least one correct replica receives a valid〈Prepare〉 message, which prevents a faulty primaryfrom framing the clients.

5.2 Speculation

Speculation reduces the cost and simplifies the de-sign of Byzantine agreement protocols, which workswell especially for systems with highly concurrentrequests. Speculation has been used by fault-freesystems and by systems that tolerate crash failures.Therefore, hBFT also works well in adaptively toler-ating crash failures to Byzantine failures. hBFT usesspeculation because replicas are always consistent forboth fault-free and normal cases where the primaryis correct. Every request takes three communication

Page 10: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

10

steps to complete, and is the theoretical lower boundfor agreement-based protocols.

Speculation does not work well for systems thathave highly computationally consuming tasks or sys-tems that have a high attack rate. The former prob-lem can be handled by separating execution fromagreement [32]. The latter problem degrades the per-formance no matter with or without recovery. Forinstance, faulty clients can simply trigger the three-phase checkpoint subprotocol on every request, whichmakes hBFT achieve similar performance as PBFTbefore the faulty clients are evacuated. The advantageof hBFT indicates that the three-phase checkpointsubprotocol is triggered rarely. Therefore, hBFT im-proves the performance in fault-free and normal casesbut achieves comparable performance as PBFT in theworst case.

6 EVALUATION

We evaluate the system on Emulab [31] utilizing upto 45 pc3000 machines connected through a 100Mbpsswitched LAN. Each machines is equipped with a2GHz, 64-bit Xeon processor with 2GB of RAM. 64-bit Ubuntu 10 is installed on every machine, runningLinux kernel 2.6.32.

We identify several “good” provable secure cryp-tographic tools for our scheme. In particular, we useRSA-FDH [4] for our digital signature scheme, andHMAC-MD5 [5, 6] for the MAC algorithm. Both ofthem are very simple, fast and provably secure. Weemploy such tools due to the fact that we wouldlike to abandon those vulnerable cryptographic toolsused in papers like PBFT and Zyzzyva. An additionalreason is that we still want to compare our schemewith them since RSA-FDH and HMAC-MD5 havealmost exactly the same efficiency as a naked RSAsignature and MD5 (using with parameter as the key).

We compare our work with Castro et al.’s imple-mentation of PBFT [7] as well as Kotla et al.’s imple-mentation of Zyzzyva [21]. All the experiments arecarried out in normal case, where a backup is faulty.Four micro-benchmarks are used in the evaluation,also developed by Castro et-al. An x/y benchmarkmeans xKB size request from clients and y KB sizereply from the replicas.

6.1 ThroughputFig. 5 compares throughput achieved for the 0/0benchmark between PBFT, Zyzzyva and hBFT whereB is the size of the batch. Fig. 6 presents the per-formance for the four benchmarks where B = 1for all benchmarks. All the data are tested in theconfiguration of f = 1.

As the number of clients increases, Zyzzyva per-forms even worse than PBFT. As indicated in Sec-tion 1.1, without batching (B = 1, f = 1), bottleneckserver of Zyzzyva (4 + 5f + 3f

b ) performs 1.2 times

0

20

40

60

80

100

120

140

160

0 10 20 30 40 50

Thro

ughput (K

ops/s

ec)

Number of clients

Throughput vs. Number of clients

PBFT(B=1)PBFT(B=10)

Zyzzyva(B=1)Zyzzyva(B=10)

hBFT(B=1)hBFT(B=10)

Fig. 5. Throughput vs. Number of Clients

0

10

20

30

40

50

0/00/4

4/04/4

Thro

ughput(

ops/s

ec)

Throughput for 0/0, 0/4, 4/0, 4/4 benchmarks

Read-OnlyhBFTPBFT

Zyzzyva

Fig. 6. Throughput for 0/0, 0/4, 4/0 and 4/4 bench-marks for systems to tolerate f = 1 faults

more MAC operations than PBFT (2 + 8fb ) and 2.4

times more MAC operations than hBFT (2 + 3fb ).

With batching (B = 10, f = 1), Zyzzyva performs 3.3more MAC operations than PBFT and 4.0 more MACoperations than hBFT.

The simulation validates the theoretical results. Ac-cording to Fig. 5, without batching, hBFT achievesmore than 40% higher than that of PBFT and 20%higher than that of Zyzzyva. With batching, hBFTachieves a peak throughput that is 2 times betterthan Zyzzyva, and still achieves 40% improvementover PBFT. The difference is due to the cryptographicoverhead of each protocol.

Additionally, hBFT works better than both Zyzzyvaand PBFT under high concurrency. As the numberof clients grows, all three protocols scale better withbatching than without. When the number of clientsexceeds 40, throughput of Zyzzyva degrades obvi-ously. All other cases remain stable when the numberof clients exceeds 30. When the number of clients isfewer than 30, hBFT with batching has an outstandinggrowth. Other than that, throughput of PBFT withbatching also grows faster compared with all theleft cases. The reply message cannot be batched andreplicas need to reply to every client in the batch,

Page 11: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

11

0

1

2

3

4

5

0 10 20 30 40 50

Late

ncy(m

s)

Number of clients

Update Latency vs. Number of clients

PBFT(B=1)PBFT(B=10)

Zyzzyva(B=1)Zyzzyva(B=10)

hBFT(B=1)hBFT(B=10)

Fig. 7. Latency vs. Number of Clients

which explains the result why Zyzzyva achieves thelowest throughput in normal cases.

For all benchmarks as shown in Fig. 6, hBFTachieves higher throughput as well. All three proto-cols achieve the best throughput for 0/0 benchmarkand the worst for 4/4 benchmark. Zyzzyva performsworse for 0/4 and 4/4 benchmarks than 4/0 bench-mark. hBFT has similar result. PBFT achieves almostthe same throughput for 0/4 and 4/0 benchmarks.This implies that the size of reply messages has moreeffect for speculation-based protocols. In addition,without batching, PBFT performs worse than bothZyzzyva and hBFT. The outstanding performance ofread-only requests is due to the read-only optimiza-tion, where replicas send reply directly to the clientswithout running agreement protocol.

To summarize this section, hBFT outperforms bothZyzzyva and PBFT in normal cases. Since PBFTachieves almost the same throughput for 0/4 and 4/0benchmarks and it achieves higher throughput withbatching, it works well for systems that have morecomputationally consuming tasks. Comparably, hBFTand Zyzzyva work well for systems that have highlyconcurrent but lightweight requests.

6.2 LatencyThe performance depends on both cryptographicoverhead and one way message latencies. Crypto-graphic overhead controls the latency of processingone message and the number of one way latenciescontrols the number of phases that the agreementprotocol goes through. In terms of critical paths be-tween sending and completing a request, PBFT hasfour if replicas send reply to the clients after preparephase. hBFT has only three, which is the theoreticallower bound of agreement protocols. Even though thecheckpoint subprotocol takes three phases in contrastto two in other protocols, it will not decrease theoverall performance significantly since the checkpointsubprotocol is triggered rarely. Zyzzyva takes three infault-free cases and five in normal cases.

0

1

2

3

4

5

6

7

0/00/4

4/04/4

Late

ncy(m

s)

Latency for 0/0, 0/4, 4/0, 4/4 benchmarks

Read-OnlyhBFT

ZyzzyvaPBFT

Fig. 8. Latency for 0/0, 0/4, 4/0 and 4/4 benchmarksfor systems to tolerate f = 1 faults

Additionally, the performance of all protocols is alsorelated to the frequency of checkpoint subprotocol aswell. It has a direct impact on hBFT due to the reasonthat checkpoint subprotocol of hBFT is more expen-sive than the other two. By default, we assume thata checkpoint subprotocol starts every 1000 requestsor batches. hBFT outperforms the other two underthis setting. If we make checkpoint subprotocol morerarely, it can be expected that hBFT will achieve evenbetter performance and vice versa.

As illustrated in Fig. 7 and Fig. 8, without batching,hBFT achieves 40% lower latency than that of PBFTand 30% lower latency than that of Zyzzyva. Withbatching, similar with the performance of throughput,Zyzzyva achieves higher latency than that of PBFT,and hBFT outperforms both. When the number ofclients increases, all the protocols scale well withoutan obvious increase in latency, which shows that allthree protocols work well under high concurrency.When the number of clients exceeds 40 and withbatching, Zyzzyva has an increase of latency. Sinceevery 〈Reply〉 message in Zyzzyva contains 3f + 1MACs and cannot be batched, the increase in latencyindicates that the cryptographic operations in the〈Reply〉 message limits the behavior of a protocol.

The performance for all the four benchmarks showssimilar results as indicated in Fig. 8. All the threeprotocols have the lowest latency for 0/0 benchmarkand the highest for 4/4 benchmark. hBFT and PBFTachieve almost the same latency for both 4/0 and0/4 benchmarks. Zyzzyva achieves lower latency for4/0 benchmark than 0/4 benchmark. The length ofreply message also reduces the latency per request forZyzzyva. The effect is not as apparent as the effect onthroughput though. Although hBFT performs betteron throughput for the 4/0 benchmark than the 0/4benchmark, it achieves almost the same latency forboth benchmarks, which indicates that the checkpointsubprotocol has a more direct effect on the throughputthan the latency.

Page 12: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

12

Overall, the latency validates the results of through-put. Our statements in Section 5 are verified by the re-sults of latency. By observing the curves of latency, wecan summarize the performance of protocols undernormal operations. On the other hand, by observingthe curves of throughput, the effects of other subpro-tocols are included.

6.3 Fault Scalability

We also examine the performance when the num-ber of replicas increases. As indicated in Fig. 1, thethroughput is related to f . We view the primary asthe bottleneck server not only because of the numberof MAC operations in the agreement, but also becauseof other effort such as processing requests. For PBFTand hBFT, the backups do not perform many fewercryptographic operations than the primary. Compara-bly, backups in Zyzzyva perform many fewer cryp-tographic operations than the primary, which canbe viewed as an advantage of Zyzzyva. However,this does not have a direct positive effect on thethroughput and latency since the primary performsmore cryptographic operations. As f increases, theperformance for all three protocols will degrade dueto the cryptographic overhead, especially withoutbatching.

Fig. 9 compares the number of cryptographic op-erations that the primary and clients perform asthe number of faults increases. In addition to PBFT,Zyzzyva and hBFT, we also include Q/U and HQ,which are two (hybrid) Byzantine quorum protocols.For the performance of a primary with or withoutbatching, as illustrated in Fig. 9(a) and Fig. 9(b), itcan be observed that batching greatly reduces thenumber of cryptographic operations as the numberof total replicas increases. For instance, although thenumber of cryptographic of PBFT is the most out-standing without batching and increases quite fast,the cryptographic overhead is almost the smallestwithout batching and remains stable as the numberof faults increases. Comparably, the number of cryp-tographic operations of Zyzzyva does not decreasetoo much without batching. Since both HQ and Q/Uare quorum-based protocols, they cannot use batch-ing and work better under low concurrency. hBFTachieves the smallest numbers with or without batch-ing. Combined with Fig. 9(c), we observe that thereis a trade-off between the number of cryptographicoperations of the bottleneck server and the clients.HQ requires the most cryptographic operations forthe bottleneck server but the fewest for clients amongall the five protocols. Zyzzyva requires a comparablylarge number for both the bottleneck server and theclients. hBFT requires the fewest for the bottleneckserver but still a high number for clients.

As illustrated in Fig. 10, as the number of faultyreplicas increases, which indicates an increase of

0

10

20

30

40

50

60

0 1 2 3 4 5

Cry

pto

gra

phic

Opera

tion p

er

Request

Faults Tolerated

Bottleneck Server Cryptographic Operations With b=1

PBFT

Q/U

HQ

Zyzzyva

hBFT

(a) Bottleneck server, b = 1

0

10

20

30

40

50

60

0 1 2 3 4 5

Cry

pto

gra

phic

Opera

tion p

er

Request

Faults Tolerated

Bottleneck Server Cryptographic Operations With b=10

PBFT

Q/U

HQ

Zyzzyva

hBFT

(b) Bottleneck server, b = 10

0

10

20

30

40

50

60

0 1 2 3 4 5

Cry

pto

gra

phic

Opera

tion p

er

Request

Faults Tolerated

Clients Cryptographic Operations

PBFT

Q/U

HQ

Zyzzyva

hBFT

(c) Client

Fig. 9. Server and client cryptographic operations

the number of total replicas, the latency of PBFTincreases quickly without batching. With batching,PBFT achieves a more stable curve. Zyzzyva achieveshigher latency than the other two protocols for eachcase. The latency of hBFT on the other hand, stabilizesand does not show any trend of growing to a largedegree with or without batching. The key factorsin the performance are not only the critical pathsand the number of cryptographic operations, but also

Page 13: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

13

0

1

2

3

4

5

6

7

8

f=1f=2

f=3f=4

f=5

Late

ncy(m

s)

Fault Scalability: Latency

hBFT(B=1)hBFT(B=10)PBFT(B=1)

PBFT(B=10)Zyzzyva(B=1)

Zyzzyva(B=10)

Fig. 10. Fault Scalability: Latency

0

20

40

60

80

100

120

1 1.5 2 2.5 3 3.5 4 4.5 5

Thro

ughput (K

ops/s

ec)

Number of faults

Fault-Scalability: Throughput

PBFT(B=1)PBFT(B=10)

hBFT(B=1)hBFT(B=10)

Zyzzyva(B=1)Zyzzyva(B=10)

Fig. 11. Fault Scalability: Throughput

the message complexity. Although Zyzzyva has morecryptographic overhead, it requires the same numberof messages as hBFT, which explains why both scalebetter than PBFT.

Not surprisingly, as shown in Fig. 11, the through-put shows a similar trend with latency. As the sys-tem scales, when f is greater than 2, throughputof Zyzzyva decreases obviously, especially withoutbatching. Zyzzyva scales better than PBFT but the per-formance degrades obviously when f is greater than4. hBFT scales better than both Zyzzyva and PBFTwith or without batching. The difference between thenumbers of cryptographic operations is still the key tothe overall performance. When the number of faultsis 5 and assuming b equals 10 if used, PBFT requires42 MACs without batching and only 6 with batching,Zyzzyva requires 44 MACs without batching and 30.5with batching, and hBFT requires 17 MACs withoutbatching and 3.5 with batching. For systems with highconcurrency, PBFT and hBFT are preferred and scalewell as the number of faults increases.

6.4 A BFT Network File System

This section describes our evaluation of a BFT-NFSservice implemented using PBFT [?], Zyzzyva [21],

and hBFT, respectively. Similarly, in the NFS service,we evaluate the performance of normal cases where abackup server fails. The NFS service exports a file sys-tem, which can then be mounted on a client machine.The replication library and the NFS daemon is calledto reach agreement on the order when replicas receiveclient requests. Once processing is done, replies aresent to clients. The NFS daemon is implemented usinga fixed-size memory-mapped file.

We use the Bonnie++ benchmark [12] to compareour three implementations with NFS-std, an unrepli-cated NFS V3 implementation, using an I/O intensiveworkload. The Bonnie++ benchmark includes the fol-lowing directory operations (DirOps): (1) create filesin numeric order; (2) stat() files in the same order;(3) delete them in the same order; (4) create files inan order that will appear random to the file system;(5) stat() random files; (6) delete the files in randomorder.

We evaluate the performance when a failure occursat time zero, as detailed in Fig. 12. In addition, upto 20 clients run bonnie++ benchmark concurrently.The results show that hBFT completes every type ofoperations with lower latency than all of other proto-cols. The main difference lies on the write operations.This is due to the fact that all the three protocols useread-only optimization, where replicas sends replymessages to the clients directly without running theagreement protocol. Compared with NFS-std, hBFTonly causes 5% overhead while PBFT and Zyzzyvacause 10% and 15% overhead, respectively.

0 20 40 60 80 100 120 140 time(s)

NFS-std

hBFTZyzzyva

PBFT

43

21

Write(char) Write(block) Read(char) Read(block) DirOps

Fig. 12. NFS Evaluation with the Bonnie++ bench-mark.

7 CONCLUSION

In this paper, we presented hBFT, a hybrid, Byzantinefault-tolerant, replicated state machine protocol withoptimal resilience. By re-exploiting speculation, thetheoretical lower bound for throughput and latency,as well as the requirement on clients’ participationhave been achieved for both fault-free and normal casesin hBFT. hBFT is a fast protocol that moves some jobsto the clients but can still tolerate faulty clients. Wehave also proven the safety and liveness properties ofhBFT and demonstrated how hBFT improves on theperformance of PBFT without several of the trade-offsof other protocols, some of which also use speculation.

Page 14: BFT: Speculative Byzantine Fault Tolerance With Minimum Cost

14

ACKNOWLEDGEMENTS

We would like to thank Matt Bishop, Jeff Rowe,Haibin Zhang, Hein Meling, Tiancheng Chang andLeander Jehi for their helpful comments and contri-butions to the paper. This research is based on worksupported by the National Science Foundation underGrant Number CCF-1018871. Any opinions, findings,and conclusions or recommendations expressed inthis material are those of the authors and do not neces-sarily reflect those of the National Science Foundation.

REFERENCES[1] M. Abd-El-Malek, G. Ganger, G. Goodson, M. Reiter, J. Wylie.

Fault-scalable Byzantine fault-tolerant services. In SOSP2005: 59-74.

[2] Y. Amir, B. Coan, J. Kirsch, J. Lane. Byzantine replication underattack. In DSN, 2008: 97-206.

[3] Y. Amir, C. Danilov, D. Dolev, J. Kirsch, J. Lane, C. Nita-Rotaru,J. Olsen, D. Zage. Scaling Byzantine fault-tolerant replication towide area networks. In DSN, 2006: 105-114.

[4] M. Bellare and P. Rogaway. The exact security of digital sig-natures: How to sign with RSA and Rabin. In Advances inCryptology - Eurocrypt 96, Lecture Notes in Computer ScienceVol. 1070, Springer-Verlag, 1996.

[5] M. Bellare. New proofs for NMAC and HMAC: Security with-out collision-resistance. In Advances in Cryptology - Crypto2006, Lecture Notes in Computer Science Vol. 4117, Springer-Verlag, 2006.

[6] M. Bellare, R. Canetti, and H. Krawczyk. Keying hash functionsfor message authentication. In Advances in Cryptology - Crypto96, Lecture Notes in Computer Science Vol. 1109, Springer-Verlag, 1996.

[7] M. Castro, and B. Liskov. Practical Byzantine fault tolerance. InOSDI, 1999: 173-186.

[8] T. Chandra, V. Hadzilacos and S. Toueg. The weakest failuredetector for solving consensus. In J. ACM 43(4): 685-722, 1996.

[9] T. Chandra, and S. Toueg. Unreliable failure detectors for reli-able distributed systems. In PODC 1991: 325-340.

[10] B. Chun, P. Maniatis, S. Shenker, J. Kubiatowicz. Attestedappend-only memory: making adversaries stick to their word.In SOSP 2007.

[11] A. Clement, M. Marchetti, E. Wong, L. Alvisi, and M. Dahlin.Making Byzantine Fault Tolerant Systems Tolerate ByzantineFaults. In NSDI 2009: 153-168.

[12] R. Coker. www.coker.com.au/bonnie++.[13] J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira.

HQ replication: A hybrid quorum protocol for Byzantine faulttolerance. In OSDI 2006: 177-190.

[14] A. Doudou, B. Garbinato, and R. Guerraoui. Encapsulating fail-ure detection: from crash to Byzantine failures. In Ada-Europe2002: 24-50.

[15] C. Dwork, and N. Lynch. Consensus in the presence of partialsynchrony. In J. ACM 35(2): 288-323,1988.

[16] M. Fischer, N. Lynch, and M. Paterson. Impossibility of dis-tributed consensus with one faulty process. In J. ACM,32(2): 374382,1985.

[17] R, Guerraoui, N. Knezevic, V. Quema, and M. Vukolic. The next700 BFT protocols. In EuroSys 2010: 363-376.

[18] A. Haeberlen, P. Kouznetsov, and P. Druschel. The case forByzantine fault detection. In HotDep, 2006.

[19] J. Hendricks, S. Sinnamohideen, G. Ganger, M Reiter. Zzyzx:scalable fault tolerance through Byzantine locking. In DSN2010: 363-372.

[20] M. Hurfin, M. Raynal. A simple and fast asynchronous consen-sus protocol. In Distributed Computing 12(4): 209-223, 1999.

[21] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong.Zyzzyva: speculative Byzantine fault tolerance. In SOSP2007: 45-58.

[22] L. Lamport. The part-time parliament. Technical Report 49,DEC Systems Research Center, 1989.

[23] L. Lamport. Fast Paxos. In Distributed Computing,19(2): 79103, 2006.

[24] L. Lamport, R. Shostak, and M. Pease. The Byzantine generalsproblem. In ACM Trans. Program. Lang. Syst., 4(3), 1982.

[25] D. Malkhi and M. Reiter. Byzantine quorum systems. In Dis-tributed Computing, 11(4), 1998.

[26] J. Martin, and L. Alvisi. Fast Byzantine consensus. In IEEETrans. Dependable Sec. Comput. 3(3): 202-215, 2006.

[27] Y. Mao, F. Junqueira, and K. Marzullo. Towards low latencystate machine replication for uncivil wide-area networks. InHotDep 2009.

[28] J. Knight and N. Leveson. An Experimental Evaluation of TheAssumption of Independence in MultiVersion Programming. InIEEE Trans. Software Eng. 12(1): 96-109, 1986.

[29] G. Veronese, M. Correia, A. Bessani, and L. Lung. Spin one’swheels? Byzantine fault tolerance with a spinning primary. InSRDS 2009: 135-144.

[30] R. Rodrigues, M. Castro, and B. Liskov. BASE: using abstrac-tion to improve fault tolerance. In ACM Trans. Comput. Syst.21(3): 236-269, 2003.

[31] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad,M. Newbold, M. Hibler, C. Barb, A. Joglekar An integratedexperimental environment for distributed systems andnetworks. In OSDI 2002: 255-270.

[32] J. Yin, J. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin.Separating agreement from execution for Byzantine fault tol-erant services. In SOSP, 2003: 253-267.

[33] P. Zielinski. Low-latency atomic broadcast in the presence ofcontention. In DISC, 2006: 505-519.

[34] P. Zielinski. Optimistically terminating consensus: all asyn-chronous consensus protocols in one framework. In ISPDC2006: 24-33.

Sisi Duan is a Ph.D. candidate in security labof computer science , University of California,Davis. Her research interests include faulttolerance, diagnosis, and recovery in dis-tributed system, intrusion detection systems,and publish/subscribe systems.

Sean Peisert received his Ph.D. in computerscience from the University of California, SanDiego in 2007. He is currently jointly ap-pointed at Lawrence Berkeley National Lab-oratory and University of California Davis,and was previously an I3P Research Fellow.His research spans a broad cross section ofcomputer security. He is a senior member ofthe IEEE.

Karl Levitt received his Ph.D. in electri-cal engineering from New York University in1966. He has been a professor of computerscience at UC Davis since 1986 and haspreviously been a program manager directorin the National Science Foundation and di-rector of the Computer Science Laboratoryat SRI International. He conducts researchin the areas of computer security, automatedverification, and software engineering.


Recommended