Fault Tolerance by zhouwenjuan


									              LSST DMS Fault Tolerance
1. Introduction and Goals
The LSST Data Management System (DMS) will process, store, and retrieve vast quantities of data and
metadata. It will deploy large quantities of hardware and many instances of software configured into
systems spanning continents. In order to meet the telescope's overall science requirements, the DMS
must continue to operate even when some of these components have failed. This document describes
the architecture for achieving the necessary levels of fault tolerance for the overall DMS.
The DMS design must meet two key goals and one important constraint. First, it must preserve all raw
images, calibration data, metadata, and associations between these items that are received from the
telescope and its related instruments. This information is the fundamental scientific product and must
be treated as the "crown jewels" of the project.
Second, the design must ensure that the system runs as intended by its developers in order to meet
requirements for availability of the data. The stringency of these requirements differs substantially
across the various subsystems within the DMS. The most stringent are the alert publication
requirements. Catalog query and nightly archive processing requirements will be less rigorous, and data
release processing and ad hoc science pipeline processing requirements are likely to be the least
The primary constraint is cost. The fault tolerance design must do as much as possible to achieve the
above goals within budgetary limits for equipment, system development, and operations. Automation
will be required to keep the cost of operating personnel low. This constraint also means that certain
failure scenarios that inherently involve large recovery costs or have a very low probability of
occurrence will not be planned out in detail and may not be addressed at all.
This document continues by laying out the multi-dimensional availability requirements for the various
subsystems, whether derived from the science and functional requirements or implied by other system
expectations. It describes the interfaces between the hardware infrastructure, middleware, application
software, and science data quality analysis (SDQA) designs and teams with regard to fault tolerance. It
then gives examples of expected fault types and specific use cases showing how faults may occur. A set
of design patterns, comprising detection strategies, response strategies, and software frameworks
implementing those strategies, is then described. The document concludes with a description of how
subsystem developers will apply these design patterns, customizing them for unique situations.

2. Requirements
2.1. Subheading

3. Interfaces for Fault Tolerance
When discussing fault tolerance, it is necessary to define the dividing lines between aspects handled by
the hardware infrastructure, the middleware, and the application code. The following sections outline
the general characteristics and specific requirements of these interfaces.
3.1. Middleware/Infrastructure Interface
This section defines the interface between the middleware and infrastructure teams with regard to fault
tolerance. The chosen model is for middleware to define the architecture for handling fault tolerance
and derive hardware reliability requirements from that architecture. Those requirements are given in
this document. The infrastructure team implements the hardware requirements in the infrastructure
reference design, a separate document.

3.1.1. Computation/Storage Nodes
The hardware infrastructure must provide adequately reliable computation and storage nodes. Even
architectures that presume only “commodity” levels of reliability still cannot provide the needed
quality of service on top of nodes with high failure rates.
The fault tolerance strategies described below and the overall DMS budget require nodes that have a
mean time between failures of at least X hours, with a mean time to repair or replace of at most X
hours. No more than X% of the nodes in any cluster may be down (for any unscheduled reason
including disk failure) at the same time, unless the entire data center is down. Those X% may include
one or more entire racks. No center shall be down for unscheduled reasons for more than X% of the
time or more than once per month, except in the event of a catastrophe such as earthquake, volcano, or
war. Certain cluster nodes (e.g. database servers or fault tolerance monitors) will be designated critical;
they will not all be down more than 0.1% of the time or more than once per month, unless the entire
center is down.
With regard to disk storage, the use of the data replication and checksumming strategies implies that
the probability of loss or corruption due to hardware error of any disk-based data megabyte must be
less than 1 in 106 per year.
The infrastructure reference design uses these requirement to determine more detailed specifications for
components such as:
       Disk
       CPU
       Power supplies
       Network interfaces
       Memory
The infrastructure acquisition and deployment plan also includes mechanisms for improving system
reliability by means such as insisting on choosing components from multiple production batches or
even multiple vendors to prevent common-mode failures.
The infrastructure reference design also specifies sufficient power and cooling systems, including
multiple power feeds and backup generators, to meet the node availability requirements above.

3.1.2. Network
The DMS is also highly dependent on its network infrastructure. The network must provide adequate
bandwidth for both normal and failure recovery modes of operation.
The middleware-defined fault tolerance strategies require intra-cluster LANs that are reliable to enable
long-running computations to make progress. Node-level networking faults are treated as overall node
faults. Intra-cluster networks may have failures of 10 millisecond duration or less up to once per hour;
longer failures must not occur more than X times per year, with a maximum outage duration of X
Intra-center LANs will have longer retry capabilities, so failures of 1 second duration or less may be
tolerated up to once per day. Longer failures must not occur more than X times per year, with a
maximum outage duration of X hours.
Inter-center WANs may have failures of 1 minute duration or less up to once per week. Longer failures
must not occur more than X times per year, with a maximum outage duration of 48 hours.
The network reference design translates these requirements into specifications for such items as
physically separate redundant WAN paths, switch provisioning and reliability, cabling and connector
reliability, etc.

3.2. Middleware/Applications Interface
This section defines the interface between the middleware and application software teams with regard
to fault tolerance.
Middleware has defined a set of strategies for handling fault tolerance that provide a range of
reliabilities. Middleware will provide software frameworks that implement these strategies. The
frameworks will in turn provide certain services to applications, including a cluster computation model,
a definition of the quantum of restartability, inter-node communications mechanisms, and defined
mechanisms for handling failures. These services are documented with each strategy in section X.
Application software developers are responsible for choosing an appropriate strategy and associated
framework to meet their overall subsystem reliability requirements. The algorithms available to be used
by application developers may be constrained by the services provided by the framework. The
strategies chosen for each critical component of the DMS are given in section X below.

3.2.1. Application Faults
Certain faults are best detected by the application code. These include permanent application faults
(usually due to algorithmic conditions) and some types of resource faults. The middleware-provided
strategies distinguish between permanent faults that cannot be resolved by retrying and temporary
faults that may be resolved by retrying. The application may report faults in either of these categories to
the middleware.
Application-level faults that require more sophisticated handling than provided by the middleware
framework must be both detected and handled by the application itself.

3.3. Middleware/SDQA Interface
SDQA plays an important role in detecting faults that affect the scientific usefulness of LSST results
and that may not be observable in any other way. Since SDQA involves scientific analysis and
judgment, the failures it finds, whether discovered by automated or manual processes, will inevitably
require a human in the loop for diagnosis and repair. This feedback is considered to be out of the scope
of fault tolerance and is addressed in separate SDQA documents.

4. Design Patterns and Strategies
Fault tolerance can be enforced in many different ways. Different strategies have different
characteristics, such as response time, complexity, required hardware, or cost. We expect there will not
be a single strategy that will cost-effectively meet the fault tolerance needs of the entire LSST DMS.
Instead, we expect to support several different types of strategies and apply the most appropriate
strategy to each part of the system.
The middleware group has devised several general-purpose design patterns and fault tolerance
strategies that can be applied to components of the DMS to enable them to respond to fault causes and
to meet reliability requirements. Each pattern or strategy is in turn implemented in a software
framework or subsystem that provides services to the application layer. The fault tolerance software
will typically handle large classes of fault causes in the same way in order to ensure that all significant
fault causes have been considered; customized handling for specific causes will be up to the
There are two general services used in multiple patterns: monitoring and checksumming. The pattern
options identified so far include data replication, process replication, reducing data quality, and various
options for rescheduling tasks. In addition, there are certain design practices that improve the ability of
the middleware to handle faults.

4.1. General Services
4.1.1. Monitoring
Services, managers, and processes in the data management middleware will be monitored for activity
and proper operation through a set of established practices and strategies. Important services, daemons,
and threads will have associated high availability services and watchdogs to detect failures and start the
system on a path to recovery.
High availability services will serve as external monitors of the state of critical services; they will
detect errors of various types and provide the framework for failovers, thus enabling recovery from
even catastrophic errors that terminate the service instances. The high availability services will
replicate the state of the critical services across multiple auxiliary resources to allow for failover to a
backup service or daemon. Hence, redundancy will be a key aspect of the use of high availability
services. The monitoring service for a given system service will be configured with detailed knowledge
of the target service so as to identify quickly the occurrence of a failed state or degraded process. Upon
detection of a problem, the monitoring service will take the system out of the standard operating state
and place it into a class of recovery state with a prescribed strategy to take the system back to
operation. One of several distinct recovery states may be selected, depending on detected conditions.
For example, the service may repair the existing instance of the target daemon if this is feasible; if not,
it may initiate a failover to a backup instance where the state is rolled back to a recent checkpoint.
Examples of services that will be protected with failover mechanisms are pipeline managers, event
messaging brokers, and databases.
Long-lived processes that execute application codes and perform computation in the pipeline
framework will be monitored by watchdog daemons that check on their activity and state. The
watchdog daemons will listen on messages from the running processes (for example, through
subscription to appropriate event topics or channels), and will also employ heartbeat monitoring to
periodically register if a process is still active or not. In addition to checking if the process is alive, the
heartbeat monitor will examine the state of the outputs and properties of the process to assess whether
it is performing adequately or has started operating at a level that is inordinately subpar. Failure modes
that a process could exhibit include: 1) exiting abnormally, 2) operating slowly or inefficiently, 3)
hanging, or 4) exhibiting runaway behavior that consumes exorbitant resources. Such failing processes
will be halted as needed and restarted according to the selected fault tolerance strategy.
Repairs or transitioning of parallel communicators will be an important part of the recovery of the
system when failed processes are removed and/or restarted. Communicators may proceed with a subset
of running processes after failed threads have been pared away (a case of degraded service or
capability), or they may be fully reconstituted with fresh processes added to replace failed ones. The
fault tolerance strategy selected for a given subsystem will determine which path is taken. For real time
processing, communicators will not be reconstituted if the time required will cause the system to miss a
deadline and fall unacceptably behind, whereas large scale archive center reprocessing will operate in a
more thorough manner and rebuild the communicators.

4.1.2. Checksums
Detection of errors in the transfer of image files across networks and into archive spaces and file
systems will be accomplished using checksum validation after each operation. The process of ensuring
the integrity of the raw data will begin with the creation of a checksum at an early juncture, before the
arrival of the image into the context of the LSST DM system proper. Redundant data will be generated
at a very early stage as well and maintained throughout the operation of the system.
The data access framework will manage the transfer of image files from site to site within the
distributed LSST DM archive and guarantee reliable transfers through the use of the checksum
validation. This verification will occur after long haul transfers between geographically disparate sites,
and also as part of pipeline processing when data is staged to disk on computing platforms. As data
arrives at each site, new checksums will be calculated and compared to the preexisting checksums for
the data, with comparisons done at multiple levels to ensure correctness. To localize errors and
optimize overall performance, the system will instruct the checksum algorithm to attempt the recovery
of errors by calculating where the expected error might be, and if possible, perform the repair without
the extra overhead of invoking or scheduling a retransmission of the data. If a repair is not successful,
the framework will initiate a retransmission of the image associated with the checksum error.
Beyond the checks on individual transfers, data scrubbing of image file collections on devices in the
final archive storage on a seasonal basis will be crucial to maintaining data integrity. This is necessary
because hardware-level data integrity checks cannot ensure full reliability, particularly of file metadata
and the association of that metadata with the data it describes.

4.2. Data Replication
The primary fault tolerance mechanism for data is to replicate it. Storage in multiple geographically-
dispersed locations is mandatory for all critical data.
Two types of software components are needed to fulfill this strategy: file replicators and database
replicators. Both need to ensure efficient transmission of the data in question, integrity of the data
during transmission, and verified reception and storage of the data at its destination. Replication
destinations may include systems similar to the source, as for file storage or active database replicas, or
differing systems, particularly for backup purposes.

4.3. Reducing Data Quality
Not all fault tolerance strategies produce the same results as the non-faulty case. One strategy for
dealing with corrupt or inaccessible data involves falling back to the most recent working version. The
data involved may be scientific data or software. As an example, a snapshot of the object database will
be taken before beginning nightly alert processing. If database corruption is detected during the night,
we will replace the live copy with the known-good snapshot, sacrificing recent updates but enabling
continuing processing. This strategy is only applicable when degraded results are acceptable and
preferable to no results at all.

4.4. Process Replication
Certain critical systems will use process replication to handle faults. In this strategy, multiple instances
of a given process are active simultaneously. Each instance maintains identical state, either by
executing the same state update operations or by replicating the state data. If one instance fails, another
will provide the same result with minimal latency. This hot/hot mode of operation is preferred to a
hot/warm or hot/cold mode of operation in which a replica instance is not actively used by the system.
It ensures that the replica instance is actually operable, provides resource headroom to manage
unexpected spikes in usage, and encourages automated failover mechanisms in place of slower and
more error-prone human-driven mechanisms. The cost, of course, is the need for at least twice the
resources. Accordingly, this strategy will be used only for key subsystems that do not utilize large
numbers of machines. Clients will either need to select one of the active instances to contact or will
need to select one response (perhaps the first to arrive) if all active instances are contacted.
Software frameworks incorporating server selection, state replication, monitoring, and alerting will be
developed. Database management systems and their application interfaces typically already contain
large portions of this functionality.

4.5. Independent Task Rescheduling
In large-scale parallel computations, failures of individual nodes are unavoidable. Replication of the
entire computation is infeasible. One major strategy for handling faults is to continue with the
remainder of the calculation and reschedule the failed portion using spare capacity, either on spare
nodes or on the non-failed nodes after the remainder of the calculation is complete. This provides on-
time delivery of most of the data; failed data may be delivered on time if the failure occurs early
enough in the calculation but will otherwise be delayed. If the algorithm does not require
communication between nodes, and hence tasks are independent, rescheduling may be performed
easily. In addition, independent tasks allows the scheduler to provide load balancing when tasks take
differing amounts of time.

4.6. Dependent Task Rescheduling With Dropped Communication
If tasks are not independent and require inter-node communication, one alternative is to ignore failed
nodes during the computation, treating them as if they did not exist. In effect, no communications
would be sent to or received from the failed nodes. The tasks on those nodes would be rescheduled on
spare capacity as in the previous strategy, but again without communication with the rest of the
The advantage of this strategy is that some data with reduced quality is available on time, and the rest
of the data, again with reduced quality, is available later. For example, we parallelize the nightly alert
calculations by amplifier over a 200 machine cluster with 16 cores per machine. In this scenario, a
failure of one machine results in those 16 amplifiers’ calculations being rescheduled onto a spare
machine. The other machines’ results are delivered on time, although degraded due to lack of
information from the failed machine. The rescheduled results, also degraded, will be available a short
time after the desired alert deadline.

4.7. Dependent Task Rescheduling With Checkpoints
The alternative to dropping inter-node communication is to checkpoint the intermediate state of each
node’s computation, or its intended communications, to a location off the node. Then, a failed node’s
processing can be resumed from the checkpoint, allowing the overall computation to proceed, albeit
with a delay. The frequency of checkpoints can be adjusted to trade between reduced restart latency and
increased checkpoint storage and bandwidth requirements.
4.8. Design Practices
There are many design practices that application developers need to be aware of when choosing
algorithms and implementing them. The DMS consistently uses these practices to dramatically
decrease the complexity and cost of building a robust and fault tolerant system.
Avoid overwrites. It is much easier to provide fault tolerance in a system where data is appended. To
recover from failure in a system where data is overwritten requires making a snapshot of data before
each update. Accordingly, the DMS will append records wherever possible.
Segregate mutable and immutable data. The DMS will contain catalogs that are on the order of a
petabyte in size where only a few thousand individual rows will be updated during any given night.
Keeping these rows separate from the remaining billions of rows opens many powerful ways to
optimize and speed up recovery from a failure during an update.
No I/O by application stages. LSST DMS application stages do not make direct disk accesses. This
includes operations like operating on files or issuing database queries. Funneling disk I/O through the
middleware simplifies capturing faults, retrying and recovering from hardware failures, as well as
properly capturing provenance.
Minimize inter-node communication. Algorithms that require data from multiple nodes will in
general force checkpointing of intermediate results or cause degradation of data quality if missing data
is ignored. The application algorithms selected for the DMS baseline design use information from other
nodes only where necessary to produce results meeting the science requirements.

5. Development Plan Impact
At this point in the LSST DMS design process we have specified the fault tolerance strategies to be
used by system components and made preliminary assignments of these strategies to each component.
This level of specification is sufficient to estimate cost and schedule impacts of the fault tolerance
The fault tolerance software frameworks, as described above, are being developed as part of the
middleware layer. Their areas of applicability, detection strategies, and recovery strategies are being
clearly documented so that future application pipelines can choose appropriate strategies. Tests of the
common failure classes and their handling by the frameworks will be included in not only framework-
specific test suites but also integration test suites.
A preliminary fault tolerance design pattern has been chosen for each of the DMS components so that
the overall system can meet its availability requirements, as specified in the following table:
                    System                   Reduced   Redundanc Independen     Dependent        Dependent
                                              Data         y       t Task        Tasks —          Tasks —
                                             Quality             Rescheduli      Dropped         Checkpoint
                                                                     ng Nightly Data Pipelines Image Processing Pipeline       X                             At Base Facility   At Archive
                                                                                                   Center Detection Pipeline                                            At Base Facility   At Archive
                                                                                                   Center Association Pipeline            X                             At Base Facility   At Archive
                                                                                                   Center Moving Object Pipeline                                        At Base Facility   At Archive
                    System                       Reduced   Redundanc Independen     Dependent        Dependent
                                                  Data         y       t Task        Tasks —          Tasks —
                                                 Quality             Rescheduli      Dropped         Checkpoint
                                                                         ng Alert Processing                                                  At Base Facility   At Archive
                                                                                                       Center Data Release Pipelines                                                                      X Calibration Products Pipeline                                                               X Science Data Archive Catalogs and MetaData Image Meta-data                             X Source Catalog                                        X                            Complex
                                                                                                      Queries Object Catalog                                        X                            Complex
                                                                                                      Queries Orbit Catalog                               X Alert Archive                               X Engineering & Facility Data                 X
 Archive Deep Object Catalog                                   X                            Complex
                                                                                                      Queries Image Archive                                            X Data User Interfaces                           X         X Operational Control and                           X
 Monitoring System

In the next segment of the R&D phase, prior to Critical Design Review, the DMS system developers
will document the detailed configuration of the fault tolerance design for each component. They will
show that the availability requirements will be met. Preventive maintenance procedures will also need
to be documented.
System-unique failure modes will be incorporated into system test plans.
The fault tolerance design will be validated in Data Challenges 3 and 4.
The end result will be a common set of fault tolerance concepts and practices, implemented
consistently throughout the DMS in software and hardware.

A. Use Cases
The subsections below present use cases for faults in the LSST DMS. These use cases answer the
question: what are the possible things that could or might happen to cause a fault in the LSST DMS? In
UML and the ICONIX process, a fault-tolerance use case essentially comprises the "alternate course"
of a basic use case.
Of course, it is impossible to make a complete list of everything that could possibly go wrong.
However, listing the problems from past experience on prior ground-based astronomical pipeline-
processing projects seems to be a good starting point for building up a set of credible fault-tolerance
use cases for the LSST DMS. And delving into the details will ensure that the LSST DMS will be
robust, as it is commonly held that well over half of a project's complexity is caused by dealing with
alternate courses of action or, simply, faults.
These use cases are primarily fomulated from the perspective of what could or might go wrong to
interfere with the transmission and/or preservation of the raw data (images and metadata), and/or
production of the processed data products (nightly alerts, and science catalog data). Prevention of loss
of raw image data and metadata is an important and absolutely essential job of the LSST DMS.
Another class of use cases of deep concern is the generation of nightly alerts and preservation of
science catalog data. This is attuned toward meeting LSST science requirements, which is obviously
the most important, and special attention to meeting functional requirements is also critical.
The LSST DMS has subsystems in the summit observatory, base facility, archive center, and data
access centers. These subsystems are data-connected via a short-haul network between summit and
base, and a long-haul network from base to archive center and data access centers. DMS subsystems
include facilities, staff, hardware, software, database, and data. Furthermore, it is assumed that
acquisition of the raw data is outside of the purview of the DMS, but, once the raw data is acquired, it
falls into the DMS domain.
Any loss of raw images and their metadata are absolutely not allowed under system requirements and,
hence, any such loss discussed below in the context of use cases is really only temporary loss. The data
backup plan therefore includes highly reliable storage media, aggressive checksum validation, file
storage cross-validation with database records, and geographically distributed redundant copies of the
data. Redundant copies of the data will also be validated. Both the raw-image-data files and raw-image-
metadata database tables are maintained redundantly at the Base Facility and the Archive Center.
Finally, any generic software fault could also apply to the SDQA software, and to fault-tolerance-
related software, such as watchdog monitors, etc. Note that the terminology "SDQA fault" used below
basically refers to mistakes made by the SDQA subsystem in missing problems with the data, and
falsely identifying problems with the data. Since no detection algorithm is perfect, it is expected that
neither SDQA-detection completeness nor reliability will be 100%. Nevertheless, the SDQA system
will be tuned to achieve the best possible compromise between completeness and reliability for the
LSST DMS overall.
1.1. Prioritization of Faults
We classify LSST DMS faults in terms of their priority and, to cover all cases, designate three levels of
Priority-1 faults require the highest level of attention and resources, and are those that:
    1. Delay transmission of the raw images and their metadata from the summit to the base facility
    2. Prevent reliable storage of raw images and their metadata
    3. Result in only temporary unavailability of raw images or their metadata, rather than complete
If faults of this priority occur, the data will be irrecoverably lost unless the data backup plan is
comprehensive, reliable and bullet-proof.
Priority-2 faults require lower levels of attention and resources than Priority-1 faults, and are those
associated with
    1. Inability to meet the 60-s requirement for nightly alert generation
    2. Loss of science catalog data
One rationale for the content of the Priority-2 level is that the nightly alerts and science catalogs,
derived from the raw data, are the primary processed data products of the LSST DMS. These derived
data products can be recomputed from the raw data, but at some dollar cost, as well as failing to meet
the 60-second requirement. A robust plan for minimizing possible faults that hinder meeting the time
constraint (for item 1) and reliably replacing lost science catalog data from redundant data backups (for
item 2) is of paramount importance.
Priority-3 faults require still lower levels of attention/resources than Priority-1 and Priority-2 faults,
and include things that can go wrong during the data release processing, which can lead to loss of
processed images, their metadata, science catalog data, and other database metadata (until the
associated raw data are reprocessed), especially in recent processing history, and temporary reduction
in data-processing throughput.

1.2. Use Cases for Priority-1 Faults
The following are use cases that cover loss of raw image data and metadata.
    1. Faults in temporary summit storage
            Raw image data are lost
            Raw image metadata are lost
            Raw image data/metadata associations are lost
    2. Faults in temporary base storage
            Raw image data are lost
            Raw image metadata are lost
            Raw image data/metadata associations are lost
    3. Faults in primary archive storage
            Raw image data are lost
            Raw image metadata are lost
            Raw image data/metadata associations are lost
    4. Faults in redundant archive storage
            Raw image data are lost
            Raw image metadata are lost
            Raw image data/metadata associations are lost
    5. Uncorrected errors in TCP network data transfer
Metadata about the raw-image data include, but are not limited to, all copies of database records
indicating where the primary and redundant copies are stored.
Data loss includes data corruption, which effectively renders the data useless.
Data corruption includes unrecoverable errors found by disk ECC and silent data corruption (not
detected by disk ECC).

1.3. Use Cases for Priority-2 Faults
The following are use cases that cover faults that prevent nightly alert generation within 60 seconds and
loss of science catalog data.
    1. Nightly alerts are not generated because of facility fault
    2. Nightly alerts are not generated because of human fault
    3. Nightly alerts are not generated because of hardware fault
    4. Nightly alerts are not generated because of resource fault
    5. Nightly alerts are not generated because of software fault
    6. Nightly alerts are not generated because of database fault
    7. Nightly alerts are not generated because of data fault
    8. Nightly alerts are not generated because of SDQA fault
    9. Sources and/or objects are misidentified or inaccurate because of SDQA fault
    10.       Sources and/or objects database records are lost from primary storage
    11.       Sources and/or objects database records are lost from redundant storage
Possible facility, human, hardware, resource, software, database and data faults are detailed separately
below. In some cases, the specific fault leading to processing failure can be classified in multiple
categories. Note that database faults are put in a separate category because of their special nature and
the specialization required to address them.

1.4. Use Cases for Priority-3 Faults
The following are use cases that cover loss of processed image data, especially in recent processing
    1. Data release processing fails because of facility fault
    2. Data release processing fails because of human fault
    3. Data release processing fails because of hardware fault
    4. Data release processing fails because of a resource fault
    5. Data release processing fails because of software fault
    6. Data release processing fails because of database fault
    7. Data release processing fails because of data fault
    8. Data release processing fails because of SDQA fault
    9. Sources and/or objects are misidentified or inaccurate because of SDQA fault
    10.       Sources and/or objects database records are lost from primary storage
    11.       Sources and/or objects database records are lost from redundant storage
Possible facility, human, hardware, resource, software, database and data faults are detailed separately
below. In some cases, the specific fault leading to processing failure can be classified in multiple
categories. Note that database faults are put in a separate category because of their special nature and
the specialization required to address them.
1.5. Underlying Causes of Faults
1.5.1. Facility Faults
       Natural disaster (fire, earthquake, flood, tornado, etc.)
       Man-made catastrophe (radioactive contamination, airline crash, poisonous gas, etc.)
       Act of war (attack, siege, sabotage, etc.)
       Security
             Computer firewall breach (hacker, virus, etc.)
             Unauthorized computer-room access
       System resets (e.g., checksum mismatches correlate with this)
       Air-conditioning malfunction
       Electrical fuse blown

1.5.2. Human Faults
       Staff problems (malicious intent, negligence, retention/turnaround, labor strike, slow down or
        sick out, etc.)
       Pipeline-operator procedural error
       Specialist unavailability (e.g., DBA or MySql expert during crisis)
       Slow turnaround in fixing/delivering software bugs

1.5.3. Hardware Faults
       Summit-base fiber link/interfaces failure ("short haul")
       Global data transfer link/interfaces failure ("long haul")
       CPU failure
       RAM failure
       Local disk failure
       Power supply failure
       Network switch failure
       Network disk problems
               Catastrophic failure (disk media, disk controller, etc.)
               Corrupted data
                     Latent sector errors caught by disk ECC
                     Silent corruption (checksum mismatches)
       Hardware upgrade not compatible with software (portability issues, backward incompatibility,
       Unsuccessful machine reboot

1.5.4. Resource Faults
       Power failure (black out, brown out, etc.)
       Insufficient disk space
       Disk performance degradation (can occur for disks > 90% full, fragmentation, etc.)
       Disk threshing caused by insufficient memory
       Disk/network speed mismatch (bandwidth, maximum number of reads/writes per second, etc.)
       Database resource faults
             Bandwidth limitations caused by resource over-allocation
             Too many database connections
             Performance degradation caused by:
                      Large tables filling up
                     Too many queries running
                     Large queries running
                     Usage statistics not updated
                     Insufficient table-space allocation
                     Progressive index computation slowdown
                     Transaction logging disk space filling up
                     Transaction rollback taking too long
                     Miscellaneous mistunings
      Insufficient disk-space allocation
      Network bandwidth limitation (sustained or peak specifications exceeded)
      Memory segment fault (stack size exceeded, insufficient heap allocation, misassignment of
       large-memory process to small-memory machine, etc.)
      OS limits exceeded (queue length for file locking, number of open files per process, etc.)
      Bottleneck migration (e.g., increase in processor throughput hammers database harder)

1.5.5. Software Faults
      Software inadequacies and bugs flushed out by data-dependent processing
      Incorrect software version installed
      Incompatibility with operation system software
      OS, library, database software, or third-party-software upgrade problem
      Cron job, client, or deamon inadvertently stopped
      Environment misconfiguration or loss (binary executable or third-party software not in path,
       dynamic library not found, etc.)
      Processing failures due to algorithmic faults
            Division by zero
            No convergence of iterative algorithm
            Insufficient input data
      Processing failures related to files
            Can't open file
            File not found
      Processing failures related to sockets
            Port number not available
            Socket connection broken
      Processing failures related to database (also see section on database faults below)
            Can't connect to database
            Missing stored function
      Faults associated with user-contributed software
      Problems with user retrieving data from archive
      Problems reverting to previous build (incomplete provenance of software and builds)

1.5.6. Database Faults
      Database server goes down
      Database client software incompatible with database server software
      Bugs in upgraded versions of database server software
      Can't connect to database
      Can't set database role
      Can't execute query
      Can't execute stored function
      Missing stored function
      Queries take too long
      Table locking
      Transaction rollback error
      Transaction logging out of disk space
      Record(s) missing
      More than one record unexpectedly returned
      Inserting record with primary key violation or missing foreign key

1.5.7. Data Faults
      Uncorrected errors in TCP communications
      Missing or bad input data
            Bad images (missing, noisy data, or instrument-artifact-contaminated pixels; not enough
              good sources for sufficient astrometric and/or photometric calibration; etc.)
            Missing/unavailable database data (e.g., PM and operations activities not synchronized)
            Bad or wrong calibration data used in processing
            Unavailability of calibration images (missing observations, calibration-pipeline error,
                     Use lower quality fallback calibration data (affects SDQA)
                     Missing fallback calibration data
            Unavailability of configuration or policy data files
      Failure to flag dead, dying, or hot pixel-detectors in data mask
      Publicly release data is found to have problems after it is already released

1.5.8. SDQA Faults
      Incorrect or mistuned QA-metric threshold setting(s) for automatic SDQA
      Failure to do sufficient manual SDQA on a particular data set

To top