Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance

- by Sudha Elavarti

Introduction• The growing reliance of industry and government on

online information services.• Malicious successful attacks become more serious.• Software errors are more due to the growth in size

and complexity of software.• These causes faulty nodes to exhibit Byzantine

behavior.• The paper presents practical algo. for state machine

replication that works in asynchronous systems like the internet.

…continued• The paper makes following contributions:-

– Describes state machine replication protocol that survives Byzantine faults.

– Describes number of optimizations that allow algo. to perform well in real systems.

– Describes implementation of Byzantine-fault tolerant distributed file system.

– Provides experimental results that quantify the cost of replication technique.

System Models• Assumptions:

– Asynchronous distributed system where nodes are connected by a network.

– The network may fail to deliver messages, delay, duplicate or deliver them out of order.

– Byzantine failure model: faulty nodes may behave arbitrarily.

– Independent node failures.– The adversary cannot delay correct nodes

indefinitely and cannot subvert the cryptographic techniques.

System model contd…• Cryptographic techniques

– Public-key signatures.– Message authentication codes.– Message digest produced by collision-resistant

hash functions.

Service properties• The algorithm can be used to implement any

deterministic replicated service with a state and some operations.

• Algorithm provides both safety and liveness assuming no more than [n-1/3] faulty replicas.

• Safety is provided to any number of faulty clients, using the service.

• Liveness is guaranteed, i.e clients eventually receive replies to the request, provided atmost [n-1/3] replicas are faulty.

Service properties contd..• 3f+1 is minimum number of replicas that allow an

asynchronous system to provide safety and liveness.– Where f is number of faulty replicas.

• n= 3f+1 replicas are needed because it must be possible to proceed after communicating with n-f replicas since f replicas might be faulty and not responding.

• But the f replicas that did not respond may be non-faulty and therefore f of those responded may be faulty.

• n-2f > f therefore n > 3f.• Algo does not address the problem of fault tolerant

privacy.– Faulty replica may leak information to an attacker .

Algorithm• Algorithm works roughly as follows

– A client sends a request to invoke a service operation to the primary

– The primary multicasts the request to the backups

– Replicas execute the request and send a reply to the client

– The client waits for f+1 replies from different replicas with the same result; this is the result of the operation.

• Set of replicas – R• Identify each replica by using an integer in {0,1,….,|R|-

1}.• |R|=3f+1, where f is max number of faulty replicas.• Replicas move through a succession of configurations.• In a view one replica is the primary and the others are

backups. Views are numbered consecutively.• The primary of a view is replica p such that p= v mod |R|,

where v is the view number. • View changes are carried out when it appears that the

primary has failed.• all non-faulty replicas agree on a total order for the

execution of requests despite failures.

The Client• Client c requests the execution of state machine operation o by

sending a {REQUEST,o,t,c} message to the primary. • Timestamp t is used to ensure exactly-once semantics.• Timestamps for c ’s requests are totally ordered such that later

requests have higher timestamps than earlier ones.• Primary atomically multicasts the requests to all the backups.• All replicas sends the reply {REPLY,v,t,c,i,r}, directly to the client.

– Where v = current view number t = timestamp of the corresponding request

i = replica number r = result of executing the requested operation.

• Client waits for f+1 replies with valid signatures from different replicas, and with same t and r , before accepting the result r.

Client contd…• If the client does not receive replies soon enough, it broadcasts the

request to all replicas. If the request has already been processed, the replicas simply re-send the reply; replicas remember the last reply message they sent to each client.

• If the primary does not multicast the request to the group, it will eventually be suspected to be faulty by enough replicas to cause a view change.

Normal-Case Operation• state of each replica is stored in a message log.• Primary p receives a client request m , it starts a

three-phase protocol.• Three phases are: pre-prepare, prepare, commit.• Pre-prepare and prepare phases is used to totally order

requests.• In pre-prepare phase

– Primary assigns sequence number n to request.– Multicast pre-prepare msg. with m piggybacked to all backups

and appends the msg. to its log.– Msg= < < pre-prepare,v,n,d > ,m >

d=msg m’s digest

• If backup i accepts the pre-prepare msg. it enters prepare phase by multicasting <PREPARE,v,n,d,i> msg to all other replicas and adds both msgs to its log. Otherwise does nothing.

• a replica (including primary) accepts prepare msg and adds them to its log, provided – Their signatures are correct– The view number equals the replica’s current view number.– Their sequence number is between h and H.

• We define predicate prepared (m,v,n,i)= true, iff 2f prepares from different backups that match the pre-prepare.

• When prepared = true, replica i multicasts a <COMMIT,v,n,D(m),i> to other replicas.

• Replicas accept commit msgs and insert them in their log provided signatures are same.

• We define committed and committed-local predicates as follows.– Commited(m,v,n) = true, iff prepared(m,v,n,i) is true for all i in

some set of f+1 non-faulty replicas.– Committed-local(m,v,n,i) = true iff the replica has accepted 2f+1

commit msg from different replicas that match the pre-prepare for m

• Replica i executes the operation requested by m after committed-local(m,v,n,i)= true and i’s state reflects the sequential execution of all requests with lower sequence numbers.

• This ensures that all non-faulty replicas execute request in same order as required to provide safety property.

• The algorithm provides safety if all non-faulty replicas agree on the sequence number of requests that commit locally.

Garbage Collection• GC is mechanism used to discard msg’s from the log.• For the safety condition to hold, messages must be kept in a

replica’s log until it knows that the requests that concern have been executed by alteast f+1 non-faulty replicas.

• This is achieved by checkpoint, which occur when a request with sequence number (n) is divisible by some constant is executed.

• When a replica i produces a checkpoint it multicasts a msg <CHECKPOINT,n,d,i> to other replicas.

• Each replica collects checkpoint msgs in its log until it has 2f+1 of them for sequence number n with same digest d.

• This creates a stable checkpoint and the replica discards all the pre-prepare, prepare and commit msgs.

• Checkpoint protocol is used to advance low and high water marks. Low water mark h=the sequnce num of last stable check point and high water mark= h+k, where k is large enough

View Changes• View change protocol provides liveness by allowing by

allowing the system to make progress when the primary fails. View changes are triggered by timeouts that prevent backups from waiting indefinitely for request to execute.

• If the timer of backup expires in view v, the backup starts a view change to move the system to view v+1. it stops accepting messages (other than check-point, view-change, and new-view messages) and multicast a <VIEW-CHANGE, v+1, n, C, P, i>.

• When the primary p of view v+1 receives 2f valid view-change messages from other replicas, it multicasts a <NEW-VIEW, v+ 1, V, O> message to all other replicas.

Liveness• To provide Liveness replicas must move to a new view if they are

unable to execute a request.• To avoid starting a view change too soon, a replica that multicasts a

view-change message for view v+1, waits for 2f+1 view-change messages and then starts the timer T.

• If the timer T expires before receiving new-view msg it starts the view change for view v+2. The timer will wait 2T before starting a view-change from v+2 to v+3.

• If a replica receives f+1 valid view-change messages from other replicas for views greater than its current view, it sends a view-change message for the smallest view in the set, even if T expires.

• Faulty replicas cannot cause a view-change by sending a view-change message. View-change will happen only if at least f+1 replicas send view-change message

• The above three techniques guarantee liveness, unless message delays grow faster than the timeout period indefinitely.

OptimizationsReducing Communication

• Three optimizations are used to reduce the cost of communication– First avoid sending most large replies.

• Reduces bandwidth consumption.• Reduces CPU overhead.

– Second optimization reduces the number of message delays for an operation invocation.

– Third optimization improves the performance of read-only operations that do not modify the service state.

Cryptography• Digital signatures are used only for view-change and new-view

messages. All other messages are authenticated using message authentication codes ( MAC).

• MACs can be computed three orders of magnitude faster than digital signatures.

• Other public-key cryptosystems generate faster signatures, but low verification and in this algorithm each signature is verified many times.

• Each node shares a 16-byte secret session-key with each replica.• Digital signature in a reply message is replaced by single MAC,

signatures in all other messages are replaced by vectors of MACs called authenticators.

• Time to verify an authenticator is constant, the size grows linearly with the number of replicas, but slowly.

ImplementationThe Replication Library

• The client interface to the replication library consists of a single procedure, invoke, with one argument, and an input buffer containing a request to invoke a state machine operation.

• On the server side the replication code makes a number of up calls to procedures that server part of replication must implement.

• The procedures are , execute, make_checkpoint,delete_checkpoint, get_digest, get_checkpoint, set_checkpoint.

• Point-to-point communication between nodes is implemented using UDP, and multicast to the group of replicas is implemented using UDP over IP multicast

• The algorithm tolerates out-of-order delivery and rejects duplicates.

Byzantine-Fault-tolerant File System

Byzantine-Fault-tolerant File System

• BFS is implemented using replication library• Application processes run unmodified and interact through the NFS

client in the kernel.• User-level relay processes mediate communication between the

standard NFS client and the replicas.• Relay receives NFS requests, invokes procedure of replication

library and sends the result back to NFS client.• Each replica runs a user-level process with replication library and

NFS V2 daemon, which is referred as snfsd.• Replication library receives request from the relay and interacts with

snfsd by making up calls.

Performance Evaluation

• EXPERIMENTAL SETUP– Experiments measure normal-case behavior (no view-changes)– All experiments run with one client running two relays and four replicas.

Four replicas can tolerate one Byzantine fault.• Micro-benchmark provides a service-independent evaluation of the

performance of the replication library.• Andrew benchmark is used to compare BFS with two other file-

systems :-– NFS V2 implementation in Digital UNIX– BFS without replication.

Conclusion• The algorithm works correctly in asynchronous system like the

internet.• The performance of BFS is only 3% worse than the standard NFS

implementation.– Good performance is due to replacing public-key signatures by

Message Authentication Codes, reducing the size and number of messages, and the incremental checkpoint management technique.

• One reason why Byzantine fault tolerant algorithms is important in future is that they allow the system to work correctly even when there are software errors. – not all, software errors that occur in all replicas – It can mask errors that occur independently at different replicas– Non-deterministic software errors– Persistent errors

Date post:	22-Feb-2016
Category:	Documents
Upload:	clyde
View:	43 times
Download:	0 times

Practical Byzantine Fault Tolerance

Documents