Reliable Distributed Systems by 8Ns5v5T


									Reliable Distributed Systems

  How and Why Complex Systems
How and Why Systems Fail
   We’ve talked about
       Transactional reliability
       And we’ve mentioned replication for high
   But does this give us “fault-tolerant
   How and why do real systems fail?
   Do real systems offer the hooks we’ll need to
   Failure is just one of the aspects of reliability,
    but it is clearly an important one
   To make a system fault-tolerant we need to
    understand how to detect failures and plan
    an appropriate response if a failure occurs
   This lecture focuses on how systems fail, how
    they can be “hardened”, and what still fails
    after doing so
Systems can be built in many
   Reliability is not always a major goal when
    development first starts
   Most systems evolve over time, through
    incremental changes with some rewriting
   Most reliable systems are entirely rewritten
    using clean-room techniques after they reach
    a mature stage of development
Clean-room concept
   Based on goal of using “best available” practice
   Requires good specifications
   Design reviews in teams
   Actual software also reviewed for correctness
   Extensive stress testing and code coverage testing,
    use tools like “Purify”
   Use of formal proof tools where practical
But systems still fail!
   Gray studied failures in Tandem
   Hardware was fault-tolerant and rarely
    caused failures
   Software bugs, environmental factors,
    human factors (user error), incorrect
    specification were all major sources of
Bohrbugs and Heisenbugs
   Classification proposed by Bruce Lindsey
   Bohrbug: like the Bohr model of the nucleus: solid,
    easily reproduced, can track it down and fix it
   Heisenbug: like the Heisenberg nucleus: a diffuse
    cloud, very hard to pin down and hence fix
   Anita Borr and others have studied life-cycle bugs in
    complex software using this classification
    Programmer facing bugs

               Heisenbug is fuzzy,
               hard to find/fix

              Bohrbug is solid,
          easy to recognize and fix
Lifecycle of Bohrbug
   Usually introduced in some form of code
    change or in original design
   Often detected during thorough testing
   Once seen, easily fixed
   Remain a problem over life-cycle of software
    because of need to extend system or to
    correct other bugs.
   Same input will reliably trigger the bug!
Lifecycle of Bohrbug

       A Bohrbug is boring.
Lifecycle of a Heisenbug
   These are often side-effects of some other
   Example: bug corrupts a data structure or
    misuses a pointer. Damage is not noticed
    right away, but causes a crash much later
    when structure is referenced
   Attempting to detect the bug may shift
    memory layout enough to change its
How programmers fix a
   They develop a test scenario that triggers it
   Use a form of binary search to narrow in on it
   Pin down the bug and understand precisely
    what is wrong
   Correct the algorithm or the coding error
   Retest extensively to confirm that the bug is
How they fix Heisenbugs
   They fix the symptom: periodically scan the
    structure that is ususally corrupted and clean
    it up
   They add self-checking code (which may itself
    be a source of bugs)
   They develop theories of what is wrong and
    fix the theoretical problem, but lack a test to
    confirm that this eliminated the bug
   These bugs are extremely sensitive to event
Bug-free software is
   Heavily used software may become extremely
    reliable over its life (the C compiler rarely
    crashes, UNIX is pretty reliable by now)
   Large, complex systems depend upon so
    many components, many complex, that bug
    freedom is an unachievable goal
   Instead, adopt view that bugs will happen
    and we should try and plan for them
Bugs in a typical distributed
   Usual pattern: some component crashes or
    becomes partitioned away
   Other system components that depend on it
    freeze or crash too
   Chains of dependencies gradually cause more
    and more of the overall system to fail or
Tools can help
   Everyone should use tools like “purify”
    (detects stray pointers, uninitialized variables
    and memory leaks)
   But these tools don’t help at the level of a
    distributed system
   Benefit of a model, like transactions or virtual
    synchrony, is that the model simplifies
    developer’s task
Leslie Lamport
“A distributed system is one in which the failure
  of a machine you have never heard of can
  cause your own machine to become

   Issue is dependency on critical components
   Notion is that state and “health” of system at
    site A is linked to state and health at site B
Component Architectures Make it
   Modern systems are structured using object-
    oriented component interfaces:
       CORBA, COM (or DCOM), Jini
       XML
   In these systems, we create a web of
    dependencies between components
   Any faulty component could cripple the
Reminder: Networks versus
Distributed Systems
   Network focus is on connectivity but
    components are logically independent:
    program fetches a file and operates on it, but
    server is stateless and forgets the interaction
       Less sophisticated but more robust?
   Distributed systems focus is on joint behavior
    of a set of logically related components. Can
    talk about “the system” as an entity.
       But needs fancier failure handling!
Component Systems?
   Includes CORBA and Web Services
   These are distributed in the sense of our
       Often, they share state between components
       If a component fails, replacing it with a new
        version may be hard
       Replicating the state of a component: an
        appealing option…
            Deceptively appealing, as we’ll see
Thought question
   Suppose that a distributed system was built
    by interconnecting a set of extremely reliable
    components running on fault-tolerant
   Would such a system be expected to be
Thought question
   Suppose that a distributed system was built by
    interconnecting a set of extremely reliable
    components running on fault-tolerant hardware
   Would such a system be expected to be reliable?
       Perhaps not. The pattern of interaction, the need to match
        rates of data production and consumption, and other
        “distributed” factors all can prevent a system from operating
   The Web components are individually reliable
   But the Web can fail by returning inconsistent or
    stale data, can freeze up or claim that a server is not
    responding (even if both browser and server are
    operational), and it can be so slow that we consider it
    faulty even if it is working
   For stateful systems (the Web is stateless) this issue
    extends to joint behavior of sets of programs
   The Arianne rocket is designed in a modular
       Guidance system
       Flight telemetry
       Rocket engine control
       …. Etc
   When they upgraded some rocket
    components in a new model, working
    modules failed because hidden assumptions
    were invalided.
Arianne Rocket
     Attitude      Telemetry
        Guidance     Altitude

Arianne Rocket
     Attitude      Telemetry
        Guidance     Altitude
Arianne Rocket
     Attitude      Telemetry
        Guidance     Altitude

   Correctness depends very much on the
       A component that is correct in setting A may be
        incorrect in setting B
       Components make hidden assumptions
       Perceived reliability is in part a matter of
        experience and comfort with a technology base
        and its limitations!
Detecting failure
   Not always necessary: there are ways to
    overcome failures that don’t explicitly detect
   But situation is much easier with detectable
   Usual approach: process does something to
    say “I am still alive”
   Absence of proof of liveness taken as
    evidence of a failure
Example: pinging with
   Programs P and B are the primary,
    backup of a service
   Programs X, Y, Z are clients of the
   All “ping” each other for liveness
   If a process doesn’t respond to a few
    pings, consider it faulty.
Consistent failure detection
   Impossible in an asynchronous network that
    can lose packets: partitioning can mimic
       Best option is to track membership
       But few systems have GMS services
   Many real networks suffer from this problem,
    hence consistent detection is impossible “in
    practice” too!
   Can always detect failures if risk of mistakes
    is acceptable
Component failure detection
   An even harder problem!
   Now we need to worry
       About programs that fail
       But also about modules that fail
   Unclear how to do this or even how to
       Recall that RPC makes component use
        rather transparent…
Vogels: the Failure
   Argues that we would not consider someone to have
    died because they don’t answer the phone
   Approach is to consult other data sources:
       Operating system where process runs
       Information about status of network routing nodes
       Can augment with application-specific solutions
   Won’t detect program that looks healthy but is
    actually not operating correctly
Further options: “Hot” button
   Usually implemented using shared memory
   Monitored program must periodically update
    a counter in a shared memory region.
    Designed to do this at some frequency, e.g.
    10 times per second.
   Monitoring program polls the counter,
    perhaps 5 times per second. If counter stops
    changing, kills the “faulty” process and
    notifies others.
Friedman’s approach
   Used in a telecommunications co-processor
   Can’t wait for failures to be sensed, so his
    protocol reissues requests as soon as soon as
    the reply seems late
   Issue of detecting failure becomes a
    background task; need to do it soon enough
    so that overhead won’t be excessive or
    realtime response impacted
Broad picture?
   Distributed systems have many components,
    linked by chains of dependencies
   Failures are inevitable, hardware failures are
    less and less central to availability
   Inconsistency of failure detection will
    introduce inconsistency of behavior and could
    freeze the application
Suggested solution?
   Replace critical components with group of
    components that can each act on behalf of
    the original one
   Develop a technology by which states can be
    kept consistent and processes in system can
    agree on status (operational/failured) of
   Separate handling of partitioning from
    handling of isolated component failures if
Suggested Solution

              it uses
Suggested Solution

                          it uses    Transparent
              multicast   Module     replication
                           it uses
Replication: the key
   Replicate critical components for availability
   Replicate critical data: like coherent caching
   Replicate critical system state: control
    information such as “I’ll do X while you do Y”
   In limit, replication and coordination are
    really the same problem
Basic issues with the approach
   We need to understand client-side
    software architectures better to
    appreciate the practical limitations on
    replacing a server with a group
   Sometimes, this simply isn’t practical
Client-Server issues
   Suppose that a client observes a failure
    during a request
   What should it do?
   Client-server issues

Client-server issues
   What should the client do?
       No way to know if request was finished
       We don’t even know if server really
       But suppose it genuinely crashed…
   Client-server issues


Client-server issues
   What should client “say” to backup?
       Please check on the status of my last request?
            But perhaps backup has not yet finished the fault-handling
       Reissue request?
            Not all requests are idempotent
   And what about any “cached” server state? Will it
    need to be refreshed?
   Worse still: what if RPC throws an exception? Eg.
    “demarshalling error”
       A risk if failure breaks a stream connection
Client-server issues
   Client is doing a request that might be
    disrupted by failure
       Must catch this request
   Client needs to reconnect
       Figure out who will take over
       Wait until it knows about the crash
       Cached data may no longer be valid
       Track down outcome of pending requests
   Meanwhile must synchronize wrt any new
    requests that application issues
Client-server issues
   This argues that we need to make
    server failure “transparent” to client
       But in practice, doing so is hard
       Normally, this requires deterministic
            But not many servers are deterministic
       Techniques are also very slow…
Client-server issues
   Transparency
       On client side, “nothing happens”
       On server side
            There may be a connection that backup needs
             to take over
            What if server was in the middle of sending a
            How can backup exactly mimic actions of the
Other approaches to consider
   N-version programming: use more than one
    implementation to overcome software bugs
       Explicitly uses some form of group architecture
       We run multiple copies of the component
       Compare their outputs and pick majority
            Could be identical copies, or separate versions
            In limit, each is coded by a different team!
Other approaches to consider
   Even with n-version programming, we get
    limited defense against bugs
       ... studies show that Bohrbugs will occur in all
        versions! For Heisenbugs we won’t need multiple
        versions; running one version multiple times
        suffices if versions see different inputs or different
        order of inputs
Logging and checkpoints
   Processes make periodic checkpoints, log messages
    sent in between
   Rollback to consistent set of checkpoints after a
    failure. Technique is simple and costs are low.
   But method must be used throughout system and is
    limited to deterministic programs (everything in the
    system must satisfy this assumption)
   Consequence: useful in limited settings.
Byzantine approach
   Assumes that failures are arbitrary and may be
   Uses groups of components that take actions by
    majority consensus only
   Protocols prove to be costly
       3t+1 components needed to overcome t failures
       Takes a long time to agree on each action
   Currently employed mostly in security settings
Hard practical problem
   Suppose that a distributed system is built from
    standard components with application-specific code
    added to customize behavior
   How can such a system be made reliable without
    rewriting everything from the ground up?
   Need a plug-and-play reliability solution
   If reliability increases complexity, will reliability
    technology actually make systems less reliable?

To top