Date post: | 25-Aug-2018 |
Category: |
Documents |
Upload: | phamnguyet |
View: | 216 times |
Download: | 0 times |
Distributed AlgorithmsPractical Byzantine Fault Tolerance
Alberto Montresor
University of Trento, Italy
2017/01/06
This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
references
M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, andJ. Wylie.Fault-scalable Byzantine fault-tolerant services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’05, Oct. 2005.
M. Castro and B. Liskov.Practical Byzantine fault tolerance.In Proc. of the 3rd Symposium on Operating systems design andimplementation, OSDI’99, pages 173–186, New Orleans, Louisiana,USA, 1999. USENIX Association.http:
//www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf.
M. Castro and B. Liskov.Practical Byzantine fault tolerance and proactive recovery.ACM Trans. Comput. Syst., 20:398–461, Nov. 2002.http:
//www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf.
A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin,and T. Riche.UpRight cluster services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’09, Oct. 2009.http:
//www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf.
A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti.Making Byzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systemsdesign and implementation, NSDI’09, pages 153–168. USENIXAssociation, 2009.http:
//www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf.
J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira.HQ replication: A hybrid quorum protocol for Byzantine faulttolerance.In Proc. of the Symposium on Operating systems design andimplementation, OSDI’06, Oct. 2005.
S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, andH. Weinfurter.Experimental demonstration of a quantum protocol for byzantineagreement and liar detection.Physical Review Letters, 100(7), Feb. 2008.
R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin.Zyzzyva: Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles,(SOSP’07), Stevenson, WA, Oct. 2007. ACM.http:
//www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf.
L. Lamport, R. Shostak, and M. Pease.The Byzantine generals problem.ACM Transactions on Programming Languages and Systems(TOPLAS), 4(3):382–401, 1982.http://www.disi.unitn.it/~montreso/ds/papers/
ByzantineGenerals.pdf.
Contents
1 Introduction2 Byzantine generals3 Practical Byzantine Fault Tolerance4 Beyond PBFT
Overview5 Zyzzyva
IntroductionThree casesThe case of the missing phaseView changes
6 Aardvark7 UpRight
Introduction
Motivation
Processes may exhibit arbitrary (Byzantine) behaviorI Malicious attacks
F They lieF They collude
I Software errorF Arbitrary states, messages
Examples
Amazon outage (2008), “Root cause was a single bit flip ininternal state messages”1
Shuttle Mission STS-124 (2008), 3-1 disagreement on sensorsduring fuel loading (on Earth!)2
2http://status.aws.amazon.com/s3-20080720.html2https://c3.nasa.gov/dashlink/resources/624/
Alberto Montresor (UniTN) DS - BFT 2017/01/06 1 / 80
Introduction
History
State-of-the-art at the end of the 90’sI Theoretically feasible algorithms to tolerate Byzantine failures, but
inefficient in practiceI Assume synchrony – known bounds for message delays and
processing speedI Most importantly: synchrony assumption needed for correctness –
what about DoS?
Bibliography
L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem.ACM Transactions on Programming Languages and Systems (TOPLAS),4(3):382–401, 1982.http://www.disi.unitn.it/~montreso/ds/papers/ByzantineGenerals.pdf
Alberto Montresor (UniTN) DS - BFT 2017/01/06 2 / 80
Byzantine generals
Byzantine generals
Attack!
Wait…
Attack!
Attack! No, wait! Surrender!
Wait…
From cs4410 fall 08 lectureAlberto Montresor (UniTN) DS - BFT 2017/01/06 3 / 80
Byzantine generals
Specification
A commanding general must send an order to his n− 1 lieutenantgenerals such that:
I IC1: All loyal lieutenants obey the same orderI IC2: If the commanding general is loyal, then every loyal lieutenant
obeys the order he sends
Assumptions (“Oral” messages):I Every message that is sent is received correctlyI The receiver of a message knows who sent itI The absence of a message can be detected
Alberto Montresor (UniTN) DS - BFT 2017/01/06 4 / 80
Byzantine generals
Impossibility results
Under the “Oral” messages assumption, no solution with three generalscan handle even a single traitor
Comm.Gen.
Liut.1
Liut.2
Attack!Attack!
He said “Retreat”!
Comm.Gen.
Liut.1
Liut.2
Retreat!Attack!
He said “Retreat”!
Alberto Montresor (UniTN) DS - BFT 2017/01/06 5 / 80
Byzantine generals
“Oral Message” algorithm OM(m)
Algorithm OM(0)1 The commander sends its value to every lieutenant2 Each lieutenant uses the value he received from commander, or uses
retreat if he received no value
Algorithm OM(m)1 The commander send its value to every lieutenant2 ∀i, let vi be the value lieutenant i receives from the commander, or
retreat if it has received no value. Lieutenant i acts as thecommander of algorithm OM(m− 1) to send the value vi to each ofthe other n− 2 other lieutenants
3 ∀j 6= i, let vj be the value received by i from j in Step 2 ofalgorithm OM(m− 1) or retreat if no value. Lieutenant i uses thevalue majority(v1, ..., vn) (deterministic function)
The recursive
Alberto Montresor (UniTN) DS - BFT 2017/01/06 6 / 80
Byzantine generals
“Oral Message” Algorithm Example – OM(1)
A
A
A
AA A
A A
AA
A
R
C
L1
L2
L3
Alberto Montresor (UniTN) DS - BFT 2017/01/06 7 / 80
Byzantine generals
Oral messages
Theorem
For any m, Algorithm OM(m) satisfies conditions IC1 and IC2 ifthere are more than 3m generals and at most m traitors
Problems:I message paths of length up to m+ 1 (expensive)I absence of messages must be detected via time-out
(vulnerable to DoS)
Alberto Montresor (UniTN) DS - BFT 2017/01/06 8 / 80
Oral messages
Theorem
For any m, Algorithm OM(m) satisfies conditions IC1 and IC2 ifthere are more than 3m generals and at most m traitors
Problems:I message paths of length up to m+ 1 (expensive)I absence of messages must be detected via time-out
(vulnerable to DoS)
2017-0
3-0
7
DS - BFT
Byzantine generals
Oral messages
An attacker may compromise the safety of a service by delaying non-faultynodes or the communication between them until they are tagged as faulty andexcluded from the replica group. Such a denial-of-service attack is generallyeasier than gaining control over a non-faulty node.
Practical Byzantine Fault Tolerance
A Byzantine “renaissance”
Bibliography
M. Castro and B. Liskov. Practical Byzantine fault tolerance and proactive recovery.ACM Trans. Comput. Syst., 20:398–461, Nov. 2002.http://www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf
Contributions
First state machine replication protocol that survives Byzantinefaults in asynchronous networks
Live under weak Byzantine assumptions – Byzantine paxos!
Implementation of a Byzantine, fault tolerant distributed FS
Experiments measuring cost of replication technique
Alberto Montresor (UniTN) DS - BFT 2017/01/06 9 / 80
Practical Byzantine Fault Tolerance
Assumptions
System modelI Asynchronous distributed system with N processesI Unreliable channels
Unbreakable cryptographyI Message m is signed by its sender i, and we write 〈m〉σ(i), through:
F Public/private key pairsF Message authentication codes (MAC)
I A digest d(m) of message m is produced through collision-resistanthash functions
Alberto Montresor (UniTN) DS - BFT 2017/01/06 10 / 80
Assumptions
System modelI Asynchronous distributed system with N processesI Unreliable channels
Unbreakable cryptographyI Message m is signed by its sender i, and we write 〈m〉σ(i), through:
F Public/private key pairsF Message authentication codes (MAC)
I A digest d(m) of message m is produced through collision-resistanthash functions
2017-0
3-0
7
DS - BFT
Practical Byzantine Fault Tolerance
Assumptions
MACs (message authentication codes) are based on secret keys (client-server,
server-server)
Practical Byzantine Fault Tolerance
Assumptions
Failure modelI Up to f Byzantine serversI N > 3f total serversI (Potentially Byzantine clients)
Independent failuresI Different implementations of the serviceI Different operating systemsI Different root passwords, different administrator
Alberto Montresor (UniTN) DS - BFT 2017/01/06 11 / 80
Practical Byzantine Fault Tolerance
Specification
State machine replicationI Replicated service with a state and deterministic operations
operating on itI Clients issue a request and block waiting for reply
SafetyI The system satisfies linearizability, provided that N > 3f + 1I Regardless of “faulty clients”...
F all operations performed by faulty clients are observed in aconsistent way by non-faulty clients
I The algorithm does not rely on synchrony to provide safety...
LivenessI It relies on synchrony to provide livenessI Assumes delay(t) does not grow faster than t indefinitelyI Weak assumption – if network faults are eventually repairedI Circumvent the impossibility results of FLP
Alberto Montresor (UniTN) DS - BFT 2017/01/06 12 / 80
Practical Byzantine Fault Tolerance
Optimality
Theorem
To tolerate up to f malicious nodes, N must be equal to 3f + 1
Proof.
It must be possible to proceed after communicating with N − freplicas, because the faulty replicas may not respond
But the f replicas not responding may be just slow, so f of thosethat responded might be faulty
The correct replicas who responded (N − 2f) must outnumber thefaulty replicas, so
N − 2f > f ⇒ N > 3f
Alberto Montresor (UniTN) DS - BFT 2017/01/06 13 / 80
Practical Byzantine Fault Tolerance
Optimality
So, N > 3f to ensure that at least a correct replica is present inthe reply set
N = 3f + 1; more is uselessI more and larger messagesI without improving resiliency
Alberto Montresor (UniTN) DS - BFT 2017/01/06 14 / 80
Practical Byzantine Fault Tolerance
Processes and views
Replicas IDs: 0 . . . N − 1
Replicas move through a sequence of configurations called views
During view v:I Primary replica is i: i = v mod NI The other are backups
View changes are carried out when the primary appears to havefailed
Alberto Montresor (UniTN) DS - BFT 2017/01/06 15 / 80
Practical Byzantine Fault Tolerance
The algorithm
To invoke an operation, the clientsends a request to the primary
The primary multicasts the request tothe backups
Quorums are employed to guaranteeordering on operations
When an order has been agreed,replicas execute the request and senda reply to the client
When the client receives at least f + 1identical replies, it is satisfied
Client
Backup 1 Backup 2 Backup 3
Primary
Alberto Montresor (UniTN) DS - BFT 2017/01/06 16 / 80
Practical Byzantine Fault Tolerance
Problems
The primary could be faulty!I could ignore commands; assign same sequence number to different
requests; skip sequence numbers; etcI backups monitor primary’s behavior and trigger view changes to
replace faulty primary
Backups could be faulty!I could incorrectly store commands forwarded by a correct primaryI use dissemination Byzantine quorum systems
Faulty replicas could incorrectly respond to the client!I Client waits for f + 1 matching replies before accepting response
Alberto Montresor (UniTN) DS - BFT 2017/01/06 17 / 80
Practical Byzantine Fault Tolerance
The general idea
Algorithm steps are justified by certificatesI Sets (quorums) of signed messages from distinct replicas proving
that a property of interest holds
With quorums of size at least 2f + 1I Any two quorums intersect in at least one correct replicaI There is always one quorum that contains only non-faulty replicas
1. State: …A2. State: …A
3. State: …A4. State: …
Servers
Clientswrit
e A
write AX
wri
te Aw
rite A
Alberto Montresor (UniTN) DS - BFT 2017/01/06 18 / 80
Practical Byzantine Fault Tolerance
The general idea
Algorithm steps are justified by certificatesI Sets (quorums) of signed messages from distinct replicas proving
that a property of interest holds
With quorums of size at least 2f + 1I Any two quorums intersect in at least one correct replicaI There is always one quorum that contains only non-faulty replicas
…A …A B …B …B
write B
writ
e B
Xw
rite
Bwrite B
Servers
Clients
1. State: 2. State: 3. State: 4. State:
Alberto Montresor (UniTN) DS - BFT 2017/01/06 18 / 80
Practical Byzantine Fault Tolerance
Protocol schema
Normal operationI How the protocol works in the absence of failuresI hopefully, the common case
View changesI How to depose a faulty primary and elect a new one
Garbage collectionI How to reclaim the storage used to keep certificates
RecoveryI How to make a faulty replica behave correctly again (not here)
Alberto Montresor (UniTN) DS - BFT 2017/01/06 19 / 80
Practical Byzantine Fault Tolerance
State
The internal state of each of the replicas include:I the state of the actual serviceI a message log containing all the messages the replica has acceptedI an integer denoting the replica current view
Alberto Montresor (UniTN) DS - BFT 2017/01/06 20 / 80
Practical Byzantine Fault Tolerance
Client request
Primary
Backup 1
Backup 2
Backup 3
Request
〈request, o, t, c〉σ(c)o: state machine operation
t: timestamp (used to ensure exactly-once semantics)
c: client id
σ(c): client signature
Alberto Montresor (UniTN) DS - BFT 2017/01/06 21 / 80
Practical Byzantine Fault Tolerance
Pre-prepare phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare
〈〈pre-prepare, v, n, d(m)〉σ(p),m〉
v: current view
n: sequence number
d(m): digest of client message
σ(p): primary signature
m: client message
Alberto Montresor (UniTN) DS - BFT 2017/01/06 22 / 80
Pre-prepare phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare
〈〈pre-prepare, v, n, d(m)〉σ(p),m〉
v: current view
n: sequence number
d(m): digest of client message
σ(p): primary signature
m: client message2017-0
3-0
7
DS - BFT
Practical Byzantine Fault Tolerance
Pre-prepare phase
Requests are not included in pre-prepare messages to keep them small. This is
important because pre-prepare messages are used as a proof that the request
was assigned sequence number in view in view changes.
Practical Byzantine Fault Tolerance
Pre-prepare phase
〈〈pre-prepare, v, n, d(m)〉σ(p),m〉Correct replica i accepts pre-prepare if:
I the pre-prepare message is well-formedI the current view of i is vI i has not accepted another pre-prepare for v, n with a different
digestI n is between two water-marks L and H
(to avoid sequence number exhaustion caused by faulty primaries)
Each accepted pre-prepare message is stored in the acceptingreplica’s message log (including the primary’s)
Non-accepted pre-prepare messages are just discarded
Alberto Montresor (UniTN) DS - BFT 2017/01/06 23 / 80
Practical Byzantine Fault Tolerance
Prepare phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare Prepare
〈prepare, v, n, d(m)〉σ(i)
Accepted by correct replica j if:I the prepare message is well-formedI current view of j is vI n is between two water-marks L and H
Alberto Montresor (UniTN) DS - BFT 2017/01/06 24 / 80
Practical Byzantine Fault Tolerance
Prepare phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare Prepare
〈prepare, v, n, d(m)〉σ(i)
Replicas that send prepare accept the sequence number n for min view v
Each accepted prepare message is stored in the acceptingreplica’s message log
Alberto Montresor (UniTN) DS - BFT 2017/01/06 24 / 80
Practical Byzantine Fault Tolerance
Prepare certificate (P-certificate)
Replica i produces a prepare certificate prepared(m, v, n, i) iff itslog holds:
I The request mI A pre-prepare for m in view v with sequence number nI Log contains 2f prepare messages from different backups that
match the pre-prepare
prepared(m, v, n, i) means that a quorum of (2f + 1) replicasagrees with assigning sequence number n to m in view v
Theorem
There are no two non-faulty replicas i, j such that prepared(m, v, n, i)and prepared(m′, v, n, j), with m 6= m′
Proof?
Alberto Montresor (UniTN) DS - BFT 2017/01/06 25 / 80
Practical Byzantine Fault Tolerance
Commit phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare Prepare Commit
〈commit, v, n, d(m), i〉σ(i)After having collected a P-certificate prepared(m, v, n, i), replicai sends a commit message
Accepted if:I The commit message is well-formedI Current view of i is vI n is between two water-marks L and H
Alberto Montresor (UniTN) DS - BFT 2017/01/06 26 / 80
Practical Byzantine Fault Tolerance
Commit certificate (C-Certificate)
Commit certificates ensure total order across viewsI we guarantee that we can’t miss prepare certificates during a view
change
A replica has a certificate committed(m, v, n, i) if:I it had a P-certificate prepared(m, v, n, i)I log contains 2f + 1 matching commit from different replicas
(possibly including its own)
Replica executes a request after it gets commit certificate for it,and has cleared all requests with smaller sequence numbers
Alberto Montresor (UniTN) DS - BFT 2017/01/06 27 / 80
Practical Byzantine Fault Tolerance
Reply phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare Prepare Commit Reply
〈reply, v, t, c, i, r〉σ(i)r is the reply
Client waits for f + 1 replies with the same t, r
If the client does not receive replies soon enough, it broadcast therequest to all replicas
Alberto Montresor (UniTN) DS - BFT 2017/01/06 28 / 80
Practical Byzantine Fault Tolerance
View change
A un-satisfied replica backup i mutinies:I stops accepting messages (except view-change and new-view)I multicasts 〈view-change, v + 1, P, i〉σ(i)I P contains a P-certificate Pm for each request m
(up to a given number, see garbage collection)
Mutiny succeeds if the new primary collects a new-view certificateV :
I a set containing 2f + 1 view-change messagesI indicating that 2f + 1 distinct replicas (including itself) support the
change of leadership
Alberto Montresor (UniTN) DS - BFT 2017/01/06 29 / 80
Practical Byzantine Fault Tolerance
View change
The “primary elect” p′ (replica v + 1 mod N):
extracts from the new-view certificate V the highest sequencenumber h of any message for which V contains a P-certificate
creates a new pre-prepare message for any client message mwith sequence number n ≤ h and add it to the set O
I if there is a P-certificate for n,m in V
O ← O ∪ 〈pre-prepare, v + 1, n, dm〉σ(p′)
I Otherwise
O ← O ∪ 〈pre-prepare, v + 1, n, dnull〉σ(p′)
p′ multicasts 〈new-view, v + 1, V,O〉σ(p′)
Alberto Montresor (UniTN) DS - BFT 2017/01/06 30 / 80
Practical Byzantine Fault Tolerance
View change
Backup accepts a 〈new-view, v + 1, V,O〉σ(p′) message for v + 1 ifI it is signed properly by p′
I V contains valid view-change messages for v + 1I the correctness of O can be locally verified
(repeating the primary’s computation)
Actions:I Adds all entries in O to its log (so did p′!)I Multicasts a prepare for each message in OI Adds all prepares to the log and enters new view
Alberto Montresor (UniTN) DS - BFT 2017/01/06 31 / 80
Practical Byzantine Fault Tolerance
Garbage collection
A correct replica keeps in log messages about request o until:I o has been executed by a majority of correct replicas, andI this fact can proven during a view change
Truncate log with stable checkpointsI Each replica i periodically (after processing k requests) checkpoints
state and multicasts 〈checkpoint, n, d, i〉F n: last executed requestF d: state digest
A set S containing 2f + 1 equivalent checkpoint messages fromdistinct processes are a proof of the checkpoint’s correctness(stable checkpoint certificate)
Alberto Montresor (UniTN) DS - BFT 2017/01/06 32 / 80
Practical Byzantine Fault Tolerance
View Change, revisited
Message 〈view-change, v + 1, n, S, C, P, i〉σ(i)I n: the sequence number of the last stable checkpointI S: the last stable checkpointI C: the checkpoint certificate (2f + 1 checkpoint messages)
Message 〈new-view, v + 1, n, V,O〉σ(p′)I n: the sequence number of the last stable checkpointI V,O: contains only requests with sequence number larger than n
Alberto Montresor (UniTN) DS - BFT 2017/01/06 33 / 80
Practical Byzantine Fault Tolerance
Optimizations
Reducing repliesI One replica designated to send reply to clientI Other replicas send digest of the reply
Lower latency for writes (4 messages)I Replicas respond at Prepare phase (tentative execution)I Client waits for 2f + 1 matching responses
Fast reads (one round trip)I Client sends to all; they respond immediatelyI Client waits for 2f + 1 matching responses
Alberto Montresor (UniTN) DS - BFT 2017/01/06 34 / 80
Practical Byzantine Fault Tolerance
Optimizations: cryptography
Reducing overheadI Public-key cryptography only for view changesI MACs (message authentication codes) for all other messages
To give an idea (Pentium 200Mhz)I Generating 1024-bit RSA signature of a MD5 digest: 43msI Generating a MAC of the same message: 10µs
Alberto Montresor (UniTN) DS - BFT 2017/01/06 35 / 80
Practical Byzantine Fault Tolerance
Application: Byzantine NFS server
Alberto Montresor (UniTN) DS - BFT 2017/01/06 36 / 80
Practical Byzantine Fault Tolerance
Application: Byzantine NFS server
generate them are refreshed very frequently.There are no published performance numbers for
SecureRing [16] but it would be slower than Rampartbecause its algorithm has more message delays andsignature operations in the critical path.
7.3 Andrew BenchmarkThe Andrew benchmark [15] emulates a softwaredevelopment workload. It has five phases: (1) createssubdirectories recursively; (2) copies a source tree; (3)examines the status of all the files in the tree withoutexamining their data; (4) examines every byte of data inall the files; and (5) compiles and links the files.We use the Andrew benchmark to compare BFS with
two other file system configurations: NFS-std, which isthe NFS V2 implementation in Digital Unix, and BFS-nr,which is identical to BFS but with no replication. BFS-nrran two simple UDP relays on the client, and on the serverit ran a thin veneer linked with a version of snfsd fromwhich all the checkpointmanagement codewas removed.This configuration does not write modified file systemstate to disk before replying to the client. Therefore, itdoes not implement NFSV2 protocol semantics, whereasboth BFS and NFS-std do.Out of the 18 operations in the NFS V2 protocol only
getattr is read-only because the time-last-accessedattribute of files and directories is set by operationsthat would otherwise be read-only, e.g., read andlookup. The result is that our optimization for read-only operations can rarely be used. To show the impactof this optimization, we also ran the Andrew benchmarkon a second version of BFS that modifies the lookupoperation to be read-only. This modification violatesstrict Unix file system semantics but is unlikely to haveadverse effects in practice.For all configurations, the actual benchmark code ran
at the client workstation using the standard NFS clientimplementation in the Digital Unix kernel with the samemount options. The most relevant of these options forthe benchmark are: UDP transport, 4096-byte read andwrite buffers, allowing asynchronous client writes, andallowing attribute caching.We report the mean of 10 runs of the benchmark for
each configuration. The sample standard deviation forthe total time to run the benchmark was always below2.6% of the reported value but it was as high as 14% forthe individual times of the first four phases. This highvariance was also present in the NFS-std configuration.The estimated error for the reported mean was below4.5% for the individual phases and 0.8% for the total.Table 2 shows the results for BFS and BFS-nr. The
comparison between BFS-strict and BFS-nr shows thatthe overhead of Byzantine fault tolerance for this serviceis low — BFS-strict takes only 26% more time to run
BFSphase strict r/o lookup BFS-nr1 0.55 (57%) 0.47 (34%) 0.352 9.24 (82%) 7.91 (56%) 5.083 7.24 (18%) 6.45 (6%) 6.114 8.77 (18%) 7.87 (6%) 7.415 38.68 (20%) 38.38 (19%) 32.12total 64.48 (26%) 61.07 (20%) 51.07
Table 2: Andrew benchmark: BFS vs BFS-nr. The timesare in seconds.
the complete benchmark. The overhead is lower thanwhat was observed for the micro-benchmarks becausethe client spends a significant fraction of the elapsed timecomputing between operations, i.e., between receivingthe reply to an operation and issuing the next request,and operations at the server perform some computation.But the overhead is not uniform across the benchmarkphases. The main reason for this is a variation in theamount of time the client spends computing betweenoperations; the first two phases have a higher relativeoverhead because the client spends approximately 40%of the total time computing between operations, whereasit spends approximately 70% during the last three phases.The table shows that applying the read-only optimiza-
tion to lookup improves the performance of BFS sig-nificantly and reduces the overhead relative to BFS-nrto 20%. This optimization has a significant impact inthe first four phases because the time spent waiting forlookup operations to complete in BFS-strict is at least20% of the elapsed time for these phases, whereas it isless than 5% of the elapsed time for the last phase.
BFSphase strict r/o lookup NFS-std1 0.55 (-69%) 0.47 (-73%) 1.752 9.24 (-2%) 7.91 (-16%) 9.463 7.24 (35%) 6.45 (20%) 5.364 8.77 (32%) 7.87 (19%) 6.605 38.68 (-2%) 38.38 (-2%) 39.35total 64.48 (3%) 61.07 (-2%) 62.52
Table 3: Andrew benchmark: BFS vs NFS-std. Thetimes are in seconds.
Table 3 shows the results for BFS vs NFS-std. Theseresults show that BFS can be used in practice — BFS-strict takes only 3% more time to run the completebenchmark. Thus, one could replace the NFS V2implementation in Digital Unix, which is used dailyby many users, by BFS without affecting the latencyperceived by those users. Furthermore, BFS with theread-only optimization for the lookup operation isactually 2% faster than NFS-std.The overhead of BFS relative to NFS-std is not the
12
Alberto Montresor (UniTN) DS - BFT 2017/01/06 37 / 80
Practical Byzantine Fault Tolerance
Reality Check
Example of systems that have adopted Byzantine Fault Tolerance:
Boeing 777 Aircraft Information Management System
Boeing 777/787 flight control system
SpaceX Dragon flight control system
BitCoin
Alberto Montresor (UniTN) DS - BFT 2017/01/06 38 / 80
Distributed AlgorithmsPractical Byzantine Fault Tolerance
Alberto Montresor
University of Trento, Italy
2017/01/06
Acknowledgments: Lorenzo Alvisi
This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
references
M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, andJ. Wylie.Fault-scalable Byzantine fault-tolerant services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’05, Oct. 2005.
M. Castro and B. Liskov.Practical Byzantine fault tolerance.In Proc. of the 3rd Symposium on Operating systems design andimplementation, OSDI’99, pages 173–186, New Orleans, Louisiana,USA, 1999. USENIX Association.http:
//www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf.
M. Castro and B. Liskov.Practical Byzantine fault tolerance and proactive recovery.ACM Trans. Comput. Syst., 20:398–461, Nov. 2002.http:
//www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf.
A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin,and T. Riche.UpRight cluster services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’09, Oct. 2009.http:
//www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf.
A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti.Making Byzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systemsdesign and implementation, NSDI’09, pages 153–168. USENIXAssociation, 2009.http:
//www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf.
J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira.HQ replication: A hybrid quorum protocol for Byzantine faulttolerance.In Proc. of the Symposium on Operating systems design andimplementation, OSDI’06, Oct. 2005.
S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, andH. Weinfurter.Experimental demonstration of a quantum protocol for byzantineagreement and liar detection.Physical Review Letters, 100(7), Feb. 2008.
R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin.Zyzzyva: Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles,(SOSP’07), Stevenson, WA, Oct. 2007. ACM.http:
//www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf.
L. Lamport, R. Shostak, and M. Pease.The Byzantine generals problem.ACM Transactions on Programming Languages and Systems(TOPLAS), 4(3):382–401, 1982.http://www.disi.unitn.it/~montreso/ds/papers/
ByzantineGenerals.pdf.
Contents
1 Introduction2 Byzantine generals3 Practical Byzantine Fault Tolerance4 Beyond PBFT
Overview5 Zyzzyva
IntroductionThree casesThe case of the missing phaseView changes
6 Aardvark7 UpRight
Beyond PBFT Overview
Overview
After PBFT, several others papers started to appear:
HQ: J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and
L. Shrira. HQ replication: A hybrid quorum protocol for Byzantinefault tolerance.In Proc. of the Symposium on Operating systems design andimplementation, OSDI’06, Oct. 2005
Q/U: M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, and
J. Wylie. Fault-scalable Byzantine fault-tolerant services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’05, Oct. 2005
The end results has been to complicate the adoption of Byzantinesolutions.
Alberto Montresor (UniTN) DS - BFT 2017/01/06 39 / 80
Beyond PBFT Overview
Overview
“In the regions we studied (up to f = 5), if contention is low andlow latency is the main issue, then if it is acceptable to use 5f + 1replicas, Q/U is the best choice, else HQ is the best since itoutperforms PBFT with a batch size of 1.”
“Otherwise, PBFT is the best choice in this region: It can handlehigh contention workloads, and it can beat the throughput of bothHQ and Q/U through its use of batching.”
“Outside of this region, we expect HQ will scale best: HQ’sthroughput decreases more slowly than Q/U’s (because of thelatter’s larger message and processing costs) and PBFT’s (whereeventually batching cannot compen- sate for the quadratic numberof messages).”
Alberto Montresor (UniTN) DS - BFT 2017/01/06 40 / 80
Zyzzyva Introduction
Zyzzyva3
OSDI’06
R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Zyzzyva:Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles,(SOSP’07), Stevenson, WA, Oct. 2007. ACM.
http://www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf
One protocol to rulethem all!
Zyzzyva is the lastword on BFT!
(Is it?)
http://www.flickr.com/photos/matthewfch/2478230533/3Zyzzyva is the last word of the English dictionary – Apart from Zyzzyzus
Alberto Montresor (UniTN) DS - BFT 2017/01/06 41 / 80
Zyzzyva Introduction
Replica coordination
All correct replicas execute the same sequence of commands
For each received command c, correct replicas:I Agree on c’s position in the sequenceI Execute c in the agreed upon orderI Reply to the client
Alberto Montresor (UniTN) DS - BFT 2017/01/06 42 / 80
Zyzzyva Introduction
How it is done now
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare Prepare Commit Reply
Alberto Montresor (UniTN) DS - BFT 2017/01/06 43 / 80
Zyzzyva Introduction
The engineer’s Rule of thumb
Citation
Handle normal and worst case separately as a rule, becausethe requirements for the two are quite different: the normalcase must be fast; the worst case must make some progress
Butler Lampson, “Hints for Computer System Design”
Alberto Montresor (UniTN) DS - BFT 2017/01/06 44 / 80
Zyzzyva Introduction
How Zyzzyva does it
Primary
Replica 1
Replica 2
Replica 3
Request
Alberto Montresor (UniTN) DS - BFT 2017/01/06 45 / 80
Zyzzyva Introduction
Specification for State Machine Replication
Stability
A command is stable at a replica once its position in the sequencecannot change
Safety
Correct clients only process replies to stable commands
Liveness
All commands issued by correct clients eventually become stable andelicit a reply
Alberto Montresor (UniTN) DS - BFT 2017/01/06 46 / 80
Zyzzyva Introduction
Enforncing safety
Safety requires:I Correct clients only process replies to stable commands
...but RSM implementations enforce instead:I Correct replicas only execute and reply to commands that are stable
Service performs an output commit with each reply
Alberto Montresor (UniTN) DS - BFT 2017/01/06 47 / 80
Zyzzyva Introduction
Speculative BFT (Trust, but verify)
Replicas execute and reply to a command without knowingwhether it is stable
I trust order provided by primaryI no explicit replica agreement!
Correct client, before processing reply, verifies that it correspondsto stable command
I if not, client takes action to ensure liveness
Alberto Montresor (UniTN) DS - BFT 2017/01/06 48 / 80
Zyzzyva Introduction
Verifying stability
Necessary condition for stability in Zyzzyva:I A command c can become stable only if a majority of correct
replicas agree on its position in the sequence
Client can process a response for c iff:I a majority of correct replicas agrees on c’s positionI the set of replies is incompatible, for all possible future executions,
with a majority of correct replicas agreeing on a different commandholding c’s current position
Alberto Montresor (UniTN) DS - BFT 2017/01/06 49 / 80
Zyzzyva Introduction
History
History Hi,k is the sequence of the first k commands executed byreplica i
On receipt of a command c from the primary, replica appends c toits command history
Replica reply for c includes:I the application-level responseI the corresponding command history
Additional details:I Can be hashed through incremental hashing
Alberto Montresor (UniTN) DS - BFT 2017/01/06 50 / 80
Zyzzyva Three cases
Case 1: Unanimity
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
<c,k>
< r1,H
1,k >
< r2,H
2,k >
< r3,H
3,k >
< r4,H
4,k >
Client processes response if all replies match:
r1 = . . . = r4 ∧H1,k = . . . = H4,k
Alberto Montresor (UniTN) DS - BFT 2017/01/06 51 / 80
Zyzzyva Three cases
Case 1: Unanimity
Some comments:
Note that although a client has a proof that the request positionin the command history is irremediately set, no server has such aproof
Comparison of histories may be based on incremental hash
Three message hops to complete the request in the good case
Is it safe to accept the reply in this case?
All processes have agreed on ordering
Correct processes cannot change their mind later
New primary can ask n− f replicas for their histories
Alberto Montresor (UniTN) DS - BFT 2017/01/06 52 / 80
Zyzzyva Three cases
Case 2: A majority of correct replicas agree
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
<c,k>
< r1,H
1,k >
< r2,H
2,k >
< r3,H
3,k >
Is it safe to accept such a message?
Alberto Montresor (UniTN) DS - BFT 2017/01/06 53 / 80
Zyzzyva Three cases
Case 2: A majority of correct replicas agree
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
< r1,H
1,k >
< r2,H
2,k >
< r3,H
3,k >
Consider this case...
Alberto Montresor (UniTN) DS - BFT 2017/01/06 54 / 80
Zyzzyva Three cases
Case 2: A majority of correct replicas agree
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
<c,k>
< ri,H
i,k >
CC=<H1,k
, H2,k
, H3,k>
Client sends to all a commit certificate containing 2f + 1 matchinghistories
Alberto Montresor (UniTN) DS - BFT 2017/01/06 55 / 80
Zyzzyva Three cases
Case 2: A majority of correct replicas agree
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
< ri,H
i,k >
CC=<H1,k
, H2,k
, H3,k>
ack
<c,k>
Client processes response if it receives at least 2f + 1 acks
Alberto Montresor (UniTN) DS - BFT 2017/01/06 56 / 80
Zyzzyva Three cases
Case 2: A majority of correct replicas agree
Safe?
Certificate proves that a majority of correct processes agree on itsposition in the sequence
Incompatible with a majority backing a different command forthat position
Stability
Stability depends on matching command histories
Stability is prefix-closed:I If a command with sequence number k is stable, then so is every
command with sequence number k′ < k
Alberto Montresor (UniTN) DS - BFT 2017/01/06 57 / 80
Zyzzyva Three cases
Case 3: None of the above
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
< r1,H
1,k >
< r2,H
2,k >
Fewer than 2f + 1 replies match
Clients retransmits c to all replicas – hinting primary may befaulty
Alberto Montresor (UniTN) DS - BFT 2017/01/06 58 / 80
Zyzzyva The case of the missing phase
The case of the missing phase
Primary
Backup 1
Backup 2
Backup 3
Request
Pre-prepare Prepare Commit Reply
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
< ri,H
i,k >
CC=<H1,k
, H2,k
, H3,k>
ack
<c,k>
Where did the thirdphase go?
Why was it there tobegin with?
Alberto Montresor (UniTN) DS - BFT 2017/01/06 59 / 80
Zyzzyva The case of the missing phase
The missing phase – commit
Consider this scenario:
f malicious replicas, including the primary
The primary stops communicating with f correct replicas
They go on strike – they stop accepting messages in this view, aska view change
f + f replicas stops accepting messages, f + 1 replicas keepworking
The remaining f + 1 replicas are not enough to conclude thepre-prepare and prepare phases
The f correct processes that are asking a view change are notenough to conclude one, so there is no opportunity to regainliveness by electing a new primary
Alberto Montresor (UniTN) DS - BFT 2017/01/06 60 / 80
Zyzzyva The case of the missing phase
The missing phase – commit
The third phase of PBFT breaks this stalemate:
The remaining f + 1 replicasI either gather the evidence necessary to complete the request,I or determine that a view change is necessary
Commit phase needed for liveness
Alberto Montresor (UniTN) DS - BFT 2017/01/06 61 / 80
Zyzzyva View changes
Where the third phase go?
In PBFT
What compromises liveness in the previous scenario is thatthe PBFT view change protocol lets correct replicas commit toa view change and become silent in a view without anyguarantee that their action will lead to the view change
In Zyzzyva
A correct replica does not abandon view v unless it isguaranteed that every other correct replica will do the same,forcing a new view and a new primary
Alberto Montresor (UniTN) DS - BFT 2017/01/06 62 / 80
Zyzzyva View changes
View change
Two phases:I Processes unsatisfied with the current primary sent a message〈i-hate-the-primary, v〉 to all
I If a process collect f + 1 i-hate-the-primary messages, sends amessage to all containing such messages and starts a new viewchange (similar to the traditional one)
Extra phase of agreement protocol is moved to the view changeprotocol
Alberto Montresor (UniTN) DS - BFT 2017/01/06 63 / 80
Zyzzyva View changes
Optimizations
Checkpoint protocol to garbage collect histories
Replacing digital signatures with MAC
Replicating application state at only 2f + 1 replicas
Batching
Alberto Montresor (UniTN) DS - BFT 2017/01/06 64 / 80
Zyzzyva View changes
Performance7:28 • R. Kotla et al.
0
20
40
60
80
100
120
140
0 20 40 60 80 100
Thr
ough
put (
Kop
s/se
c)
Number of clients
Unreplicated
Zyzzyva (B=10)
Zyzzyva5 (B=10)
PBFT (B=10)
Zyzzyva5
PBFT
HQ
Q/U max throughput
Zyzzyva
Fig. 4. Realized throughput for the 0/0 benchmark as the number of client varies for systemsconfigured to tolerate f = 1 faults.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 20 40 60 80 100
Late
ncy
per
requ
est (
ms)
Throughput (Kops/sec)
Zyz
zyva
(B=1
)
Zyzzyva(B=10)Zyzzyva(B=20)
Zyzzyva(B=40)
PB
FT
(B=1
)
PB
FT
(B=1
0)
PB
FT(B
=20)
PB
FT(B
=40)
Fig. 5. Latency vs. throughput for systems with increasing batch sizes.
compared to Zyzzyva. However, as Figure 5 shows, further increases in batchsize do not significantly improve Zyzzyva’s performance. Conversely, PBFT’sperformance peaks with a batch size of 20, where Zyzzyva’s throughput advan-tage reduces to 23%.
ACM Transactions on Computer Systems, Vol. 27, No. 4, Article 7, Publication date: December 2009.
Alberto Montresor (UniTN) DS - BFT 2017/01/06 65 / 80
Zyzzyva View changes
Discussion
What have you learned?
Do you agree on the principles?
Alberto Montresor (UniTN) DS - BFT 2017/01/06 66 / 80
Aardvark
Aardvark4
NSDI’09
A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. MakingByzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systems design andimplementation, NSDI’09, pages 153–168. USENIX Association, 2009.
http://www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf
A new beginning!
http://en.wikipedia.org/wiki/File:
Porc_formiguer.JPG
4Aardvark is the first word of the English dictionary – Oritteropo in ItalianAlberto Montresor (UniTN) DS - BFT 2017/01/06 67 / 80
Aardvark
From the article
Surviving vs tolerating
Although current BFT systems can survive Byzantine faultswithout compromising safety, we contend that a system thatcan be made completely unavailable by a simple Byzantinefailure can hardly be said to tolerate Byzantine faults.
Alberto Montresor (UniTN) DS - BFT 2017/01/06 68 / 80
Aardvark
Conventional wisdom
Handle normal and worst case separatelyI remain safe in worst caseI make progress in normal case
Maximize performance whenI the network is synchronousI all clients and servers behave correctly
FutileI it yields diminishing return on common case
Alberto Montresor (UniTN) DS - BFT 2017/01/06 69 / 80
Aardvark
Conventional wisdom
MisguidedI encourages systems that fail to deliver BFT
Maximize performance whenI the network is synchronousI all clients and servers behave correctly
FutileI it yields diminishing return on common case
Alberto Montresor (UniTN) DS - BFT 2017/01/06 69 / 80
Aardvark
Conventional wisdom
MisguidedI encourages systems that fail to deliver BFT
DangerousI it encourages fragile optimizations
FutileI it yields diminishing return on common case
Alberto Montresor (UniTN) DS - BFT 2017/01/06 69 / 80
Aardvark
Conventional wisdom
MisguidedI encourages systems that fail to deliver BFT
DangerousI it encourages fragile optimizations
FutileI it yields diminishing return on common case
Alberto Montresor (UniTN) DS - BFT 2017/01/06 69 / 80
Aardvark
Blueprint
Build the system around execution path that:I provides acceptable performance across the broadest set of
executionsI it is easy to implementI it is robust against Byzantine attempts to push the system away
from it
Alberto Montresor (UniTN) DS - BFT 2017/01/06 70 / 80
Aardvark
Revisiting conventional wisdom
Signatures are expensive – use MACsI Faulty clients can use MACs to generate ambiguityI Aardvark requires clients to sign requests
View changes are to be avoidedI Aardvark uses regular view changes to maintain high throughput
despite faulty primaries
Hardware multicast is a boonI Aardvark uses separate work queues for clients and individual
replicasI Aardvark uses fully connected topology among replicas (separate
NICs)
Alberto Montresor (UniTN) DS - BFT 2017/01/06 71 / 80
Aardvark
MAC Attack
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
<c,k>
✔
✔
✔
✔
Alberto Montresor (UniTN) DS - BFT 2017/01/06 72 / 80
Aardvark
MAC Attack
Primary
Replica 1
Replica 2
Replica 3
c
<c,k>
<c,k>
<c,k>
✔
✗
✗
✗
Alberto Montresor (UniTN) DS - BFT 2017/01/06 73 / 80
Aardvark
Throughput
Best Faulty Client Faulty Faultycase client flood primary replica
PBFT 62K 0 crash 1k 250
QU 24K 0 crash NA 19k
HQ 15K NA 4.5K NA crash
Zyzzyva 80K 0 crash crash 0
Aardvark 39K 39K 7.8K 37K 11K
Alberto Montresor (UniTN) DS - BFT 2017/01/06 74 / 80
UpRight
UpRight
Bibliography
A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, and T. Riche.
UpRight cluster services.In Proc. of the ACM Symposium on Operating Systems Principles, SOSP’09, Oct.2009.http://www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf
A new (B)FT replication library
Minimal intrusiveness for existing apps
Adequate performance
Goal:I ease BFT deploymentI make explicit incremental cost of BFTI switching to BFT: simple change in a config file
Alberto Montresor (UniTN) DS - BFT 2017/01/06 75 / 80
UpRight
UpRight
u= max number of failures to ensure liveness
r = max number of commission failures to preserve safety
Crash
Omission Commission
Byzantiner = u = f : BFT
r = 0 : CFT
Alberto Montresor (UniTN) DS - BFT 2017/01/06 76 / 80
UpRight
UpRight
Exposes incremental cost of BFTI Byzantine agreementI if r << u, BFT ≈ CFT in replication cost
Allows richer design optionsI Byzantine faults are rare: u > rI Safety more critical than liveness: r > u
Alberto Montresor (UniTN) DS - BFT 2017/01/06 77 / 80
UpRight
Reality Check
UpRight5(Java; latest update Oct. 2009)
ArchiStar-BFT6(Java; latest update May 2015)
Bft-SMaRt7(Java; latest update Apr. 2016)
7https://code.google.com/archive/p/upright/7https://github.com/archistar/archistar-bft7http://bft-smart.github.io/library/
Alberto Montresor (UniTN) DS - BFT 2017/01/06 78 / 80
UpRight
For (far in the) future lectures
S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, and
H. Weinfurter. Experimental demonstration of a quantum protocolfor byzantine agreement and liar detection.Physical Review Letters, 100(7), Feb. 2008
Alberto Montresor (UniTN) DS - BFT 2017/01/06 79 / 80
UpRight
Reading material
M. Castro and B. Liskov. Practical Byzantine fault tolerance.In Proc. of the 3rd Symposium on Operating systems design andimplementation, OSDI’99, pages 173–186, New Orleans, Louisiana, USA, 1999.USENIX Association.http://www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf
R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Zyzzyva:Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles, (SOSP’07),Stevenson, WA, Oct. 2007. ACM.http://www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf
A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. MakingByzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systems design andimplementation, NSDI’09, pages 153–168. USENIX Association, 2009.http://www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf
Alberto Montresor (UniTN) DS - BFT 2017/01/06 80 / 80