Fault Tolerance (PowerPoint download) by dandanhuanghuang



Lecture # 10
• Recovery refers to restoring a system to its normal
  operational state
• Generally, its a very complicated process
• Following are the issues involved in recovery:
  – Basic causes that lead to failures and the types
  – How a process can recover from failure when it does not
    interact with another process
  – Effects of a process failing on other processes in
    concurrent systems, and
  – Techniques to recover cooperating processe
  – Recovery in distributed database systems
• A system consists of a set of hardware and software
  components and is designed to provide a specified service.
• Failure of a system occurs when the system does not
  perform its services in the manner specified.
• An erroneous state of the system is a state which could lead
  to a system failure by a sequence of valid state transitions
• A fault is an anomalous physical condition.
• An error is a manifestation of a fault in a system, which can
  lead to system failure
• Recovery
  – Failure recovery is a process that restores an
    erroneous state to an error-free state. (after a
    failure, restoring system to its "normal" state)
• Failure Classification
  – process failure
  – system failure
  – secondary storage failure
  – communication medium failure
• Tolerating Process Failures
  – signal process to recover internally
  – restart process from a prior state
  – abort process
• What are some situations where each is
• Recovering from System Failures
  –   amnesia -- restart in predefined state
  –   partial amnesia -- reset part of the state to predefined
  –   pause -- roll back to before failure
  –   halting -- give up
• Tolerating Secondary Storage Failures
  – archiving (periodic backup)
  – mirroring (continuous)
  – activity logging
• Tolerating Communication Medium
  – ack & resend
  – more complex fault-tolerant algorithms
• Backward versus Forward Error Recovery
• skip forward to a new correct state
   – requires contextual knowledge of what "forward" is
• go back to a previous correct state
   – overhead: takes time to save and restore state
   – fault may repeat (¥ cycling)
   – recovery may be impossible
• Backward Error Recovery
• based on recovery points
• two approaches:
   – operation-based recovery
   – state-based recovery
System Model
     Two Approaches to Fault Tolerance
• operation based
  – record a log (audit trail) of the operations performed
  – restore previous state by reversing steps
• state based
  – record a snapshot of the state (checkpoint)
  – restore state by reloading snapshot (rollback)
• Practical systems employ a combination of the two
  approaches, e.g., logging with periodic full-DB
  snapshots for archive.
     Fundamental Issues in Crash Recovery
•   disk writes are only atomic by sector
•   updates and commits require multiple writes
•   a crash may occur between writes
•   log contains record of updates, commits, and aborts
•   data is written to disk asynchronously
    – DB is cached
    – log is buffered
• The textbook jumps right into the problem of supporting
  crash recovery, without first reviewing any basic transaction
  models. The following are two more basic models than those
  mentioned in the text.
• Problems in Distributed/Concurrent
• communicating processes must coordinate
  checkpoints & rollbacks
• lost messages
• orphan messages
• livelocks
Orphan Messages and Domino Effect
Lost Messages
      Fault Tolerance

 Avoidance of disruptions due to
failures and to improve availability
                     An Overview
• A system can be designed fault-tolerant in two
  – Masking
     • It continues to perform its specified function in the even of a
  – Automated Responding
     • A system is designed for well defined behavior may or may not
       perform the specified function, however, it can facilitate actions
       suitable for recovery
• One key approach used to tolerate failure is
  – A system may employ a multiple number of processes,
    multiple number of hardware components, etc.
                 Governing Issues
• The implications of types of failure
   – Process Deaths
      • The resources allocated to a dead process must be recouped.
      • If server fails to reply, clients should be informed or otherwise
   – Machine Failure
      • All the processes running will die.
      • An absence of any kind of message indicates either process
        death or machine failure
   – Network Failure
      • It can partition a network into subnets
      • A process can not distinguish between machine failure or
      • Underlying communication network recognizes a machine failure.
     Atomic Actions & Committing
• A machine level instruction, which is indivisible,
  instantaneous, and cannot be interrupted.
• It is desirable to be able to group such instructions to make
  the group an atomic operation.
• Atomic actions are the basic building blocks in constructing
  fault-tolerant operations.
• Processes interaction are prevented to maintain the integrity
  of system.
• A transaction groups a sequence of actions and the group is
  treated as an atomic action to maintain the consistency of a
• At some point during the execution, the transaction decided
  whether to commit or abort its actions.
   – Commit: an unconditional guarantee that the transaction will be
   – Abort: an unconditional guarantee to back out of the transaction, and
     none of the effects of its actions will persist.
An Example
     Process P1 Process P2
            -          -
            -          -
       Lock(X);    Lock(X);
       X := X+Z;  X := X+Y;
       Unlock(X); Unlock(X);
            -          -
            -          -
 Characteristics of Atomic Actions
• Processes are not aware of the activity of the
  other process during the time a process
  performs an action.
• A process in action does not communicate
  with other processes.
• A process in action can detect no state
  changes except those performed by itself.
• Processes are indivisible and instantaneous,
  such that the effects on the system are as if
  they were interleaved as opposed to
              Commit Protocol
• Phase I. At the coordinator:
  – The coordinator sends a COMMIT-REQUEST
    message to every cohorts requesting the cohorts
    to commit.
  – The coordinator waits for replies from all the
• Phase I. At the cohorts:
  – On receiving the COMMIT-REQUEST message,
    a cohort takes the following actions.
    • If the transaction executing at the cohort is successful,
      it write UNDO and REDO log on the stable storage and
      sends an AGREED message to the coordinator.
      Otherwise, it sends as ABORT message to the

                                          Phase I. At the coordinator:
 The Two-Phase Commit Protocol
• At the coordinator:
  – If all cohorts reply AGREED and the coordinator also
    agrees, then the coordinator writes a COMMIT record into
    the log. Then it sends a COMMIT message to all the
    cohorts. Otherwise, the coordinator sends an ABORT
    message to all the cohorts.
  – The coordinator then waits for ACK from each cohort.
  – If an ACK is not received from any cohort within a timeout
    period, the coordinator resends the commit/abort
    message to that cohort.
  – If all the ACK are received, the coordinator writes a
    COMPLETE record to the log (to indicated the completion
    of the transaction)
 The Two-Phase Commit Protocol
• At the cohort:
  – On receiving a COMMIT message, a cohort
    release all the resources and locks held by it
    executing the transaction and sends an ACK.
  – On receiving an ABORT message, a cohort
    undoes the transaction using the UNDO log
    record, releases all the resources and locks held
    by it for performing the transaction and sends an
    The Two-Phase Commit Protocol
                              Commit_Req msg
                              sent to all cohorts
Ones or more cohort(s)
    replied abort               All cohorts agreed
   Abort msg send        wl
                                Send Commit msg
    To all cohorts                 To all cohorts
                                                             Commit_Req                      Commit_req
                                                            Msg received           qi        Msg received
                al                 cl
                                                           Agreed msg sent                  Abort msg sent
                                                            to coordinator                  To Coordinator

                                                                      wi                        ai
                                                                           Abort msg received
                                                    Commit msg received     From Coordinator
                                                      from Coordinator

                                                                      ci          Cohort
      The Three-Phase Commit Protocol
      Coordinator        ql
                              Commit_Req msg                                     Cohort
                              sent to all cohorts
Ones or more cohort(s)                                       Commit_Req                       Commit_req
                                                            Msg received            qi        Msg received
    replied abort                All cohorts agreed
   Abort msg send        wl                                Agreed msg sent                   Abort msg sent
                                 Send Prepare msg           to coordinator                   To Coordinator
    To all cohorts                  To all cohorts

                al                  pl                                 wi                         ai
                                                                            Abort msg received
                 All cohorts send                      Prepare msg           From Coordinator
                     ACK msg                             Received
                Send Commit msg                       Send ACK msg
                  To all cohorts                      To Coordinator
                                    cl                                 pi
                                                                            Commit msg received
                                                                              from Coordinator

The Nonblocking Commit Protocol for
        Single site Failure
                                                        Commit_Req msg
                                         F,T            sent to all cohorts
                 Ones or more cohort(s)
                     replied abort
                    Abort msg send                           All cohorts agreed
                     To all cohorts             wl           Send Prepare msg
                                                                To all cohorts
                                    al                       pl
                                           Abort msg sent               All cohorts send
                                            To all cohorts         F        ACK msg
T = Timeout Transition                                                 Send Commit msg
                                                                         To all cohorts
F = Failure Transition
F,T = Failure/ Timeout Transition
The Nonblocking Commit Protocol for
        Single site Failure
                              Msg received              qi                   Commit_req
                             Agreed msg sent                                 Msg received
                              to coordinator                                Abort msg sent
                                                                            To Coordinator
                                          wi                           ai
                                                Abort msg received
                        Prepare msg              From Coordinator
                       Send ACK msg                           Abort msg received
                       To Coordinator                          from Coordinator

                                               Commit msg received
T = Timeout Transition              F,T          from Coordinator
F = Failure Transition
F,T = Failure/ Timeout Transition

To top