Fault-Tolerance and Availability Background Concepts Concept of availability: Some systems have needs for high availability. Availability is the % of time a system is up. 99% availability means that the system may not be up to perform operations 1% of the time – about 15 minutes each day. This is generally a fairly low availability figure e.g. an ATM system that was down for 15 minutes each day would not be considered very good. Scheduled vs. unscheduled downtime: There is a distinction between scheduled downtime (e.g. the ATM is down when a technician comes by to refill money) and unscheduled downtime – unexpected crashes, possibly during busy times. Availability generally includes both scheduled and unscheduled downtimes. Unscheduled downtimes have much bigger impact, since they may occur at peak load times and may cause much more inconvenience (as opposed to scheduled downtimes, where we may carefully take it down only at acceptable times). Availability targets: Achieving 99.9% availability is difficult and requires a fair amount of careful design, extensive exception handling and some simple fault-tolerance techniques (e.g. checksums on messages, retry for lost messages etc.). Achieving 99.999% availability (less than 5 minutes downtime per year) is extraordinarily difficult and requires exceptional engineering, redundant hardware and very fancy architecting and fault-tolerance techniques. Space systems and telecom switches are examples of systems that have such ultra-high availability requirements. A web server for a very high traffic site e.g. google, may have 99.99% - 99.999% availability [often referred to as 4 NINES and 5 NINES respectively]. A typical business website may have 99% - 99.9% availability requirements. Obviously, it is very significantly more expensive to achieve 99.99% availability than 99.9% availability. Error, fault and failure: The foundation of reliability and availability theory is the concept of error, fault and failure. An error is the problem in the design, implementation or realization of a system that causes faults e.g. software bugs, line noise, mechanical component wear and tear etc. A fault is some unexpected and undesired behavior during the operation of a system e.g. a null pointer being referenced, attempt to divide by zero, method returning an error code or incorrect answer, lost or corrupted message, device fails etc. A failure is when the system as a whole is not able to complete operations in a way that is acceptable to the user. Errors (bugs) lead to faults (exceptions). Faults lead to failures (crashes, erroneous system behavior). Reliability, fault-tolerance and availability: Reliability is the % of attempted operations that result in failure. Reliability is mostly about avoiding faults in the first place, ensuring that the system always behaves correctly. Availability is the % of the time the system is up to perform transactions. It depends not only on how often the system fails, but also how long it takes to recover from failures once they occur. Fault- tolerance is about making sure that faults don’t lead to failures. A fault-tolerant system reacts to a fault after it has occurred, in such a way as to keep the whole system from failing. With fault-tolerance, the operation currently being performed may not finish correctly, but the system is kept form failing totally, so that future operations may continue to work correctly. Thus, fault-tolerance helps reliability, but its primary impact is on availability. To build highly reliable systems, we may include some fault detection (so that if the answer is wrong, we don’t produce the wrong answer but instead warn the user) and fault-tolerance, but the primary emphasis is on avoiding/removing bugs in the first place. To build highly available systems, the primary emphasis is on fault-tolerance, and minimizing recovery times so that problems don’t cause the system to go down. It is possible to build highly available systems that are not ultra-reliable e.g. copiers, telecom systems. Individual sheets may be copied incorrectly or jam, and individual calls may get false busy signals (called number is free but system gives a busy signal), but the system as a whole remains up nearly all the time. Thus availability and reliability are related, but not the same thing. Diagramming for fault-tolerance The primary diagramming approach for fault-tolerance is to draw fault trees. Examples may be seen at http://www.feedforward.com.au/fault_tree_analysis.htm http://www.riskspectrum.com/docs/methods_ft.htm A more complete description of the fault tree approach may be found at http://www.reliasoft.com/BlockSim/examples/rc6/ If you would like proper course notes that give you more detail about this, see http://lasar.cesnef.polimi.it/lasar/files/dispZio/faulttree.pdf Or you can try google yourself. Fault trees: Fault trees are used to identify possible causes of failures. A fault tree is a hierarchical decomposition of possible sources of faults, with the leaves representing root causes, and the root node being “system failure”. It is possible to draw separate fault trees for different types of system failure (e.g. transaction produces no response, transaction produces incorrect response, transaction never completes, system stops processing transactions etc.). A simple hierarchy of faults would contain only OR nodes. Fault trees are enhanced with AND-nodes that indicate that a given failure happens only if both causal factors exist. Most often, nodes that represent a fault are AND-ed with nodes representing an attempt to handle the fault i.e. the fault-tolerance approaches are added onto the fault tree as AND-nodes. Each node of a fault tree can have a probability associated with it, indicating the probability of that type of failure [to be more exact, this is a failure intensity – see note at the end of this section]. OR-nodes sum up the failure probabilities of each individual cause, while AND-nodes cause probabilities to be multiplied. Nodes with high probability are worth breaking down into sub-nodes, while for nodes with already low probabilities it would be a waste of time. It might also be a waste of time to break down nodes if you cannot do anything about them - have no idea how to detect and recover from those types of faults. Reliability Block Diagrams Once we have failure probability values for components, we can derive failure probabilities for systems using a reliability block diagram. This is equivalent to the fault tree approach described above – which we use just depends on convenience. This a block diagram view of the system, from the viewpoint of transaction flow. If a transaction requires using each of three blocks (i.e. the transaction is processed by component A, then by component B, then by component C, or by the three working together), they are represented in series. On the other hand, if it can use any one of three blocks (e.g. there are three servers, and the transaction can use any one of these), then these are represented in parallel. We may end up with a diagram like this: Authentication Server A Master Database Communication Transaction Module Authentication Processing Server B Module Backup Database Authentication Server C The probability of failure adds up for modules in series, and multiplies for modules in parallel. Let us assume that the three authentication servers have failure probabilities of 0.02, 0.03 and 0.04 respectively. The combined failure probability would be 0.000024. If the master database and backup database have failure probabilities of 0.05 each, their combined failure probability would be 0.0025. If the communication module and transaction processing module have failure probabilities of 0.002 each, the total failure probability for the system is 0.002 + 0.002 + 0.000024 + 0.0025 = 0.006524. Note on failure “probabilities” -> Failure intensities and MTBF: For brevity and understandability, we refer to “failure probability” above. In fact, we can only take about the probability of something failing within a given period of time, or per operation – “This component has a statistical expected failure rate of 1 every 10000 hours”, or “This software module has an expected failure rate of 1 every 50000 operations, given the typical workload”. This is referred to as “failure intensity”, and it is these values that we would use in the analysis. A related value is the MTBF (mean time between failures) value, which is 1 / failure intensity. Thus a component with an MTBF of 10000 hours has a failure intensity of 1 per 10000 hours = 0.000002 per hour. Be careful to use compatible units (operations, hours, seconds, whatever) for all the nodes. Cutting high and cutting low – fault-tolerance vs. exception-handling: It is possible to “cut” fault trees (i.e. cut the link between faults and their resulting in failures) by adding AND nodes to reduce the probability of the failure. It is worth adding AND-nodes for high probability faults. Cutting high in the tree (near the root) has the advantage that it can handle problems due to a wide variety of causes e.g. doing a system reset will prevent the system from extended failures, irrespective of what caused the system to crash in the first place. The problem with cutting high is that we do not know much about the problem, so we must adopt general approaches, that may have sweeping effects on the system. The alternative is to cut low down, where we are dealing with a very specific fault e.g. a message fails to get through, so we retransmit. This has the advantage that it can resolve the specific failure, so that we avoid drastic solutions like a reboot, but the problem is that it only handles that very specific problem, and might only work in some cases. In general, cutting low has the flavor of exception handling, and cutting high has the flavor of fault-tolerance. The recommendation is that fault-tolerance approaches should cut as high as possible, so that they are effective at dealing with a wide range of problems, irrespective of the source. Fault-tolerance is expensive, so we want to use just a few techniques that will cover all the different types of problems. In addition, as a matter of good implementation style, we should build exception handling into all software we create (whether or not it requires ultra high reliability/availability), so the problems that occur at leaf nodes are handled then and there at the coding level. In summary, always do good exception handling, and add a few high-impact general fault-tolerance techniques when the cost is justified. Analysis with fault trees: Associate failure probabilities with each leaf node. Use addition/multiplication for OR/AND nodes to determine failure probabilities at intermediate and root nodes. Then modify the fault tree by adding fault-tolerance techniques (AND nodes) until the root failure probability matches the desired target values. Thus fault trees can be used to assess the likely effectiveness of fault-tolerance techniques, and to compare approaches – add them into the tree and assess the impact on failure probabilities. FMEA: We can use a technique called FMEA (Failure modes and effects analysis) to draw up the fault tree. FMEA identifies the various possible problems, and for each analyzes its causes and consequences. It is a technique from general engineering (mechanical, electrical systems etc). Fault trees are also a general engineering approach. http://www.nepss.org/presentations/dfr9.pdf is one example of a presentation that explains FMEA (skip over the parts you are not interested in). http://www.maintenanceresources.com/ReferenceLibrary/FailureAnalysis/FailureModes. htm is a pretty good explanation, intended for non-engineers. Designing for Fault-Tolerance Fault tolerance includes three aspects: Detecting faults – This requires that we add a number of fault detection mechanisms in our system. It can be done even after the basic system has been designed and implemented – we can just retrofit these mechanisms for detecting and reporting problems. Fault detection is useful even if we don’t plan to deal with them, so that the user does not think that the operation completed successfully e.g. if we submit an update transaction and it fails, we would like to know about it. Isolating faulty modules – This is needed to keep errors from propagating. If the faulty module is allowed to continue operating, it will send bad outputs to the rest of the system and corrupt its state, so that we have no choice but to bring the whole system down. The key to doing fault isolation is to divide the system into a number of fault zones, so that faults in one part of the system don’t propagate to others. Fault zones must be built into the system architecture, so it is necessary to do this analysis and design at the very beginning of system design. Recovery from faults: This requires adding fault-tolerance techniques. Most of these can be added after the original system is designed and implemented, though some may require some infrastructural support at the architectural level (particularly hardware redundancy). Fault detection Fault detection approaches are usually relatively simple. They involve looking for symptoms – unusual behaviors that should not occur if the system is behaving normally. Some fault detection techniques: Queue buildup: If a transaction queue (e.g. network queue, processor queue) is building up indefinitely, that indicates either overload or a system failure. System designers will set queue “watermarks” and if these are exceeded, the system is assumed to be under overload. Throughput: The rate of completed work indicates whether the system has got blocked in some way. If transactions continue to arrive, but completion rates are near-zero, that indicates something is wrong. If completion rates are not zero, but much less than arrival rates for an extended period of time, that indicates overload/backlog. Consistency checks: Many operations have consistency constraints that must be true if the operation is completing correctly e.g. the balance in a bank account must be > 0 unless overdraft is permitted. Similarly, messages must have valid checksums, database fields must have valid values, documents must be well-formed (match representation rules) etc. Programs have valid and invalid states. When a system is found in an invalid state, we realize that a fault has occurred. For example, there may only be 4 valid kinds of messages that a system should be producing. If an outgoing message is none of these, it may indicate some internal failure. Heartbeats: One way to check whether a system is up is to use “heartbeats” – the system is expected to send messages periodically (e.g. every 5 secs) to some other system that listens for this heartbeat message. If the heartbeat is missed, the other system explicitly queries to check if the system is up, and if no response is received, assumes that it has failed. Fault rate: Some kinds of faults occur occasionally e.g. packet transmission errors, and are just handled However, if the rate of errors suddenly spikes, then that could indicate a problem. Timeouts: The most basic mechanism – try an operation, and if we do not get a response within a reasonable amount of time, declare a fault. Fault Isolation and Containment: Fault Zones A critical aspect of fault-tolerant design is to divide the system into fault zones. A fault zone is a block/module such that we can detect whether it is faulty, keep this fault from spreading to other modules, and can do something to recover from the fault and restore the module to an operational state. All inputs to the module are validated to avoid accidentally corrupting the internal state. Outputs from the module are subjected to consistency checks, so that if it gets into a bad state and starts producing invalid messages (e.g. invalid transaction responses), we can quickly realize the module is faulty and prevent its contaminating the rest of the system. Once we realize a module is faulty, we reset its state or replace it with an alternative module that is functioning OK. Until then, we must prevent the module from producing any outputs to other modules (or mark the outputs as untrustworthy). We must also notify other modules that this module is faulty, so that they won’t send it any more work until it is in an operational state. This way we avoid propagating failures. As part of the recovery procedure, we may end up killing some or all of the transactions currently underway involving the module, and holding up other transactions that need to use the faulty module. A fault zone may be a class, a package, a process or component. The key characteristic is that we should be able to check its inputs and outputs, and replace it without having to replace the modules it is interacting with. For example, an authentication database can be a fault zone. We can validate its inputs (valid queries, valid new usernames and passwords), and validate its outputs (authentication either accepted or rejected). If it produces any output other than these, it is clearly in a bad state. One of the ways of resolving the problem [as discussed below] is to replace it with a “last known good” authentication database. Fault recovery techniques Redundancy: May be either hardware redundancy (multiple components that do the same thing) or software redundancy (multiple processes/threads performing the same job). Redundancy in hardware is expensive, and redundant software adds a bit to system overhead. It is extremely effective at fault-tolerance. There may be overheads and challenges in managing the redundant units. The following are some terms you may hear in connection with hardware redundancy: o Cold backup: A redundant processor that we can start up if the primary processor fails. o Warm backup: A redundant processor with the right processes preloaded so that we can start sending working to it straight away if the primary fails. o Hot backup: A redundant system that is processing the same input stream as the primary, so that it can take over seamlessly if the primary fails. Until then, the backup produces all the same outputs, but doesn’t get to send them out to external systems. o Modular redundancy: Multiple systems that perform the same transaction, whose results are compared with each other. If they differ, the majority wins and that result is sent out. Checkpoint-rollback: The system may periodically store snapshots of its current state. If a failure occurs, the system is restored to the last “known good” state. If we maintain a log of incoming transactions, we may be able to “redo” the transaction since that last checkpoint e.g. if we are building a bank account management system, we may take checkpoints every hour or every day. If there is a failure, we may rollback to the last checkpoint, then re-execute all incoming transactions that involve updates to the accounts (to do this, we must keep a log of incoming transactions). Reset: Resetting the system to some predefined “reasonable” state will solve a lot of problems. For example, if we are controlling an elevator, and there is some fault in our tracking of the elevator position, we may be able to reset the elevator so that it returns to the “ground floor” position, and reset the software to match that. Even if the problem is in software, resetting helps if the problem is related to the specific state that the system happens to be in prior to the transaction/operation. Many failures are “transient” i.e. they are only seen after a particular series of transactions. [These are the hardest bugs to find using testing – bugs that result in consistent bad execution of single transactions will quickly be found by thorough systematic testing based on transaction equivalence class]. If these transient failures cause a module to crash, we can reset it and continue processing the transaction stream, and it is unlikely to crash again. We see this all the time when we edit files or browse – the browser may crash when we view a page, but we restart the browser and view the page again, and it works (though there are other kinds of bugs where the page predictably crashes the browser – once those are reported, they can be quickly traced and fixed]. Process restart: It may possible to kill runaway or defunct processes, and restart them so that they can continue to provide services to other processes they communicate with. The whole system must be designed in such a way that process restart works. This includes o Notifications to other processes that this process is faulty, and that they should not communicate with it. o Killing ongoing transactions involving this process that have been affected by the fault. o Holding up or rejecting other transactions that need the faulty process until it has been restarted. o Once the process has been restarted, being able to re-create its communication mechanisms to other processes – shared memory, sockets, mutexes etc. o Notifications to other processes of the changes in socket ids, pointers etc due to the restart. They must be able to update their addressing mechanisms. Retry / retransmit / redo operation: If the source of a fault is transient, just attempting the transaction over often removes the problem. We do this all the time too – the browser produces an error message, and we try reloading the page. Obviously, there are more techniques, but these are some common ones. The various techniques for fault-detection and fault-tolerance could have been documented as design patterns, as we did earlier, with context, problem, forces (tradeoffs), and solution description. In most cases, the tradeoff is improvement of availability / reliability at the cost of performance and/or system & software development cost. Testing fault-tolerance To test the fault-tolerance software, we must do fault injection. This involves introducing or simulating the fault, and observing the behavior. Fault injection requires instrumenting the code with additional pieces that create the problems. This may be as simple as an if statement: If (fault injection flags set) then Modify system state to simulate occurrence of the fault. The action may involve deleting received messages (to simulate lost messages), corrupting variable values and data structures (to simulate bugs), modifying the return code from operations (to simulate failed system calls or internal procedures), skipping operations (such as writes to a file and message sending, to simulate misbehavior) etc. Overloads and bad inputs can be simulated by changing the inputs being pumped into the software – this is easy to do if we are using a test harness to automate the testing of the system. The needs for fault injection must be identified at design time and communicated to the developers, so that they can build it into their implementation. Like built-in self-test code, fault injection code may also be left in when shipping the system, or it may be compiled out using compiler flags.