Fault tolerance and disaster recovery by ProQuest


More Info
									                                                       Tips and Strategies for Systems Integrators | channel talk

Fault tolerance and disaster recovery
By Michael Whitt

    n a perfect world, our control systems             without an intolerable interruption in        other documents. It is important to maintain
    would install and operate indefinitely              production even in the presence of            these documents in order to quickly recon-
    with no faults. Experience teaches us              ongoing hardware or software faults.          struct sections of the facility if needed.
otherwise.                                             A Hot-Standby set of processors that             Data security: This depends on a mix of
   Control system specifications should deal            switch primaries upon fault detection         automatic and manual processes. There
with fault tolerance and failure recovery issues.      would be an example of a fault toler-         are two primary considerations for data
   A Hazard and Operability (HazOp)                    ant system, as would an Ethernet ring         security in this context, backup and re-
study is a good vehicle for determining                topology that allows a device to fail         store. Several of the more common op-
the level of risk acceptable for a particular          without interrupting the operation of         tions for managing data follow.
production process. If a formal HazOp is               the remaining devices on the network.         1. No offsite data, daily backups to hard
not practical, then it is up to the specifica-       4. Fault evasion – Post-commissioning               drive: This is the least secure. If the
tion writer to do his own analysis.                    phase: Sense the trend toward a prob-            computer fails, the backups are likely
   According to the National Institute of              lem situation, and initiate corrective           to be lost. The likelihood of being able
Standards and Technology (NIST), there are             action before the problem occurs. Set-           to restore is remote.
four approaches to achieving dependability:            ting an alarm if network data through-        2. Daily backup to hard drive with a weekly
1. Fault avoidance – Design phase: Use                 put deviates beyond a set percentage             backup to tape or other media. Onsite
   care to eliminate potential causes                  from the norm, giving the operator               and offsite storage of tapes: This is bet-
   of faults in the design phase. This is              time to make corrective action would             ter. In this case, the most data that will
   where a HazOp or some other detailed                be an example of fault evasion.                  be lost is a week. Restoration is manual
   method of analysis is most useful. The                                                               and can happen as soon as retrieval of
   HazOp process, when coupled with the             Weighing failure recovery                           tapes. Until recently, this was probably
   proper application of industry standards         Regardless of how bulletproof the design            the most common method of archival.
   and accepted practices, followed by a            is, how fault-tolerant the system, or how        3. Daily backup to hard drive with a weekly
   thorough design check, should help               well trained the operators and technicians,         backup to tape or other media. Onsite
   filter out most—but never all—of the              system failures are still possible. Though          and offsite storage of tapes. Redundant
To top