Docstoc

Practical_Byzantine_Fault_Tolerance

Document Sample
Practical_Byzantine_Fault_Tolerance Powered By Docstoc
					Practical Byzantine Fault
       Tolerance
          - by Sudha Elavarti
                 Introduction
• The growing reliance of industry and government on
  online information services.
• Malicious successful attacks become more serious.
• Software errors are more due to the growth in size
  and complexity of software.
• These causes faulty nodes to exhibit Byzantine
  behavior.
• The paper presents practical algo. for state machine
  replication that works in asynchronous systems like
  the internet.
               …continued
• The paper makes following contributions:-
   – Describes state machine replication protocol that
     survives Byzantine faults.
   – Describes number of optimizations that allow
     algo. to perform well in real systems.
   – Describes implementation of Byzantine-fault
     tolerant distributed file system.
   – Provides experimental results that quantify the
     cost of replication technique.
              System Models
• Assumptions:
   – Asynchronous distributed system where nodes
     are connected by a network.
   – The network may fail to deliver messages, delay,
     duplicate or deliver them out of order.
   – Byzantine failure model: faulty nodes may behave
     arbitrarily.
   – Independent node failures.
   – The adversary cannot delay correct nodes
     indefinitely and cannot subvert the cryptographic
     techniques.
        System model contd…
• Cryptographic techniques
  – Public-key signatures.
  – Message authentication codes.
  – Message digest produced by collision-resistant
    hash functions.
            Service properties
• The algorithm can be used to implement any
  deterministic replicated service with a state and
  some operations.
• Algorithm provides both safety and liveness
  assuming no more than [n-1/3] faulty replicas.
• Safety is provided to any number of faulty clients,
  using the service.
• Liveness is guaranteed, i.e clients eventually receive
  replies to the request, provided atmost [n-1/3]
  replicas are faulty.
      Service properties contd..
• 3f+1 is minimum number of replicas that allow an
  asynchronous system to provide safety and liveness.
   – Where f is number of faulty replicas.
• n= 3f+1 replicas are needed because it must be possible
  to proceed after communicating with n-f replicas since f
  replicas might be faulty and not responding.
• But the f replicas that did not respond may be non-faulty
  and therefore f of those responded may be faulty.
• n-2f > f therefore n > 3f.
• Algo does not address the problem of fault tolerant
  privacy.
   – Faulty replica may leak information to an attacker .
                    Algorithm
• Algorithm works roughly as follows
   – A client sends a request to invoke a service
     operation to the primary
   – The primary multicasts the request to the
     backups
   – Replicas execute the request and send a reply to
     the client
   – The client waits for f+1 replies from different
     replicas with the same result; this is the result of
     the operation.
• Set of replicas – R
• Identify each replica by using an integer in {0,1,….,|R|-
  1}.
• |R|=3f+1, where f is max number of faulty replicas.
• Replicas move through a succession of configurations.
• In a view one replica is the primary and the others are
  backups. Views are numbered consecutively.
• The primary of a view is replica p such that p= v mod |R|,
  where v is the view number.
• View changes are carried out when it appears that the
  primary has failed.
• all non-faulty replicas agree on a total order for the
  execution of requests despite failures.
                           The Client
• Client c requests the execution of state machine operation o by
  sending a {REQUEST,o,t,c} message to the primary.
• Timestamp t is used to ensure exactly-once semantics.
• Timestamps for c ’s requests are totally ordered such that later
  requests have higher timestamps than earlier ones.
• Primary atomically multicasts the requests to all the backups.
• All replicas sends the reply {REPLY,v,t,c,i,r}, directly to the client.
   – Where v = current view number
             t = timestamp of the corresponding request
               i = replica number
               r = result of executing the requested operation.
• Client waits for f+1 replies with valid signatures from different
  replicas, and with same t and r , before accepting the result r.
                    Client contd…
• If the client does not receive replies soon enough, it broadcasts the
  request to all replicas. If the request has already been processed,
  the replicas simply re-send the reply; replicas remember the last
  reply message they sent to each client.
• If the primary does not multicast the request to the group, it will
  eventually be suspected to be faulty by enough replicas to cause a
  view change.
        Normal-Case Operation
• state of each replica is stored in a message log.
• Primary p receives a client request m , it starts a
  three-phase protocol.
• Three phases are: pre-prepare, prepare, commit.
• Pre-prepare and prepare phases is used to totally order
  requests.
• In pre-prepare phase
   – Primary assigns sequence number n to request.
   – Multicast pre-prepare msg. with m piggybacked to all backups
     and appends the msg. to its log.
   – Msg= < < pre-prepare,v,n,d > ,m >
      d=msg m’s digest
• If backup i accepts the pre-prepare msg. it enters
  prepare phase by multicasting <PREPARE,v,n,d,i> msg
  to all other replicas and adds both msgs to its log.
  Otherwise does nothing.
• a replica (including primary) accepts prepare msg and
  adds them to its log, provided
   – Their signatures are correct
   – The view number equals the replica’s current view number.
   – Their sequence number is between h and H.
• We define predicate prepared (m,v,n,i)= true, iff 2f
  prepares from different backups that match the pre-
  prepare.
• When prepared = true, replica i multicasts a
  <COMMIT,v,n,D(m),i> to other replicas.
• Replicas accept commit msgs and insert them in their log provided
  signatures are same.
• We define committed and committed-local predicates as follows.
   – Commited(m,v,n) = true, iff prepared(m,v,n,i) is true for all i in
     some set of f+1 non-faulty replicas.
   – Committed-local(m,v,n,i) = true iff the replica has accepted 2f+1
     commit msg from different replicas that match the pre-prepare for
     m
• Replica i executes the operation requested by m after committed-
  local(m,v,n,i)= true and i’s state reflects the sequential execution of all
  requests with lower sequence numbers.
• This ensures that all non-faulty replicas execute request in same
  order as required to provide safety property.
• The algorithm provides safety if all non-faulty replicas agree on the
  sequence number of requests that commit locally.
              Garbage Collection
• GC is mechanism used to discard msg’s from the log.
• For the safety condition to hold, messages must be kept in a
  replica’s log until it knows that the requests that concern have been
  executed by alteast f+1 non-faulty replicas.
• This is achieved by checkpoint, which occur when a request with
  sequence number (n) is divisible by some constant is executed.
• When a replica i produces a checkpoint it multicasts a msg
  <CHECKPOINT,n,d,i> to other replicas.
• Each replica collects checkpoint msgs in its log until it has 2f+1 of
  them for sequence number n with same digest d.
• This creates a stable checkpoint and the replica discards all the pre-
  prepare, prepare and commit msgs.
• Checkpoint protocol is used to advance low and high water marks.
  Low water mark h=the sequnce num of last stable check point and
  high water mark= h+k, where k is large enough
                 View Changes
• View change protocol provides liveness by allowing by
  allowing the system to make progress when the primary fails.
  View changes are triggered by timeouts that prevent backups
  from waiting indefinitely for request to execute.
• If the timer of backup expires in view v, the backup starts a
  view change to move the system to view v+1. it stops
  accepting messages (other than check-point, view-change,
  and new-view messages) and multicast a <VIEW-CHANGE,
  v+1, n, C, P, i>.
• When the primary p of view v+1 receives 2f valid view-change
  messages from other replicas, it multicasts a
  <NEW-VIEW, v+ 1, V, O> message to all other replicas.
                         Liveness
• To provide Liveness replicas must move to a new view if they are
  unable to execute a request.
• To avoid starting a view change too soon, a replica that multicasts a
  view-change message for view v+1, waits for 2f+1 view-change
  messages and then starts the timer T.
• If the timer T expires before receiving new-view msg it starts the
  view change for view v+2. The timer will wait 2T before starting a
  view-change from v+2 to v+3.
• If a replica receives f+1 valid view-change messages from other
  replicas for views greater than its current view, it sends a view-
  change message for the smallest view in the set, even if T expires.
• Faulty replicas cannot cause a view-change by sending a view-
  change message. View-change will happen only if at least f+1
  replicas send view-change message
• The above three techniques guarantee liveness, unless message
  delays grow faster than the timeout period indefinitely.
               Optimizations
            Reducing Communication

• Three optimizations are used to reduce
  the cost of communication
  – First avoid sending most large replies.
     • Reduces bandwidth consumption.
     • Reduces CPU overhead.
  – Second optimization reduces the number of
    message delays for an operation invocation.
  – Third optimization improves the performance
    of read-only operations that do not modify the
    service state.
                    Cryptography
• Digital signatures are used only for view-change and new-view
  messages. All other messages are authenticated using message
  authentication codes ( MAC).
• MACs can be computed three orders of magnitude faster than digital
  signatures.
• Other public-key cryptosystems generate faster signatures, but low
  verification and in this algorithm each signature is verified many
  times.
• Each node shares a 16-byte secret session-key with each replica.
• Digital signature in a reply message is replaced by single MAC,
  signatures in all other messages are replaced by vectors of MACs
  called authenticators.
• Time to verify an authenticator is constant, the size grows linearly
  with the number of replicas, but slowly.
                    Implementation
                     The Replication Library

• The client interface to the replication library consists of a single
  procedure, invoke, with one argument, and an input buffer containing
  a request to invoke a state machine operation.
• On the server side the replication code makes a number of up calls
  to procedures that server part of replication must implement.
• The procedures are , execute, make_checkpoint,delete_checkpoint,
  get_digest, get_checkpoint, set_checkpoint.
• Point-to-point communication between nodes is implemented using
  UDP, and multicast to the group of replicas is implemented using
  UDP over IP multicast
• The algorithm tolerates out-of-order delivery and rejects duplicates.
Byzantine-Fault-tolerant File System
   Byzantine-Fault-tolerant File System
• BFS is implemented using replication library
• Application processes run unmodified and interact through the NFS
  client in the kernel.
• User-level relay processes mediate communication between the
  standard NFS client and the replicas.
• Relay receives NFS requests, invokes procedure of replication
  library and sends the result back to NFS client.
• Each replica runs a user-level process with replication library and
  NFS V2 daemon, which is referred as snfsd.
• Replication library receives request from the relay and interacts with
  snfsd by making up calls.
        Performance Evaluation
• EXPERIMENTAL SETUP
   – Experiments measure normal-case behavior (no view-changes)
   – All experiments run with one client running two relays and four replicas.
     Four replicas can tolerate one Byzantine fault.
• Micro-benchmark provides a service-independent evaluation of the
  performance of the replication library.
• Andrew benchmark is used to compare BFS with two other file-
  systems :-
   – NFS V2 implementation in Digital UNIX
   – BFS without replication.
                           Conclusion
• The algorithm works correctly in asynchronous system like the
  internet.
• The performance of BFS is only 3% worse than the standard NFS
  implementation.
    – Good performance is due to replacing public-key signatures by
      Message Authentication Codes, reducing the size and number of
      messages, and the incremental checkpoint management technique.
• One reason why Byzantine fault tolerant algorithms is important in
  future is that they allow the system to work correctly even when
  there are software errors.
    –   not all, software errors that occur in all replicas
    –   It can mask errors that occur independently at different replicas
    –   Non-deterministic software errors
    –   Persistent errors

				
DOCUMENT INFO