Reliable Distributed Systems by 8Ns5v5T

VIEWS: 0 PAGES: 54

									Reliable Distributed Systems

  How and Why Complex Systems
              Fail
How and Why Systems Fail
   We’ve talked about
       Transactional reliability
       And we’ve mentioned replication for high
        availability
   But does this give us “fault-tolerant
    solutions?”
   How and why do real systems fail?
   Do real systems offer the hooks we’ll need to
    intervene?
Failure
   Failure is just one of the aspects of reliability,
    but it is clearly an important one
   To make a system fault-tolerant we need to
    understand how to detect failures and plan
    an appropriate response if a failure occurs
   This lecture focuses on how systems fail, how
    they can be “hardened”, and what still fails
    after doing so
Systems can be built in many
ways
   Reliability is not always a major goal when
    development first starts
   Most systems evolve over time, through
    incremental changes with some rewriting
   Most reliable systems are entirely rewritten
    using clean-room techniques after they reach
    a mature stage of development
Clean-room concept
   Based on goal of using “best available” practice
   Requires good specifications
   Design reviews in teams
   Actual software also reviewed for correctness
   Extensive stress testing and code coverage testing,
    use tools like “Purify”
   Use of formal proof tools where practical
But systems still fail!
   Gray studied failures in Tandem
    systems
   Hardware was fault-tolerant and rarely
    caused failures
   Software bugs, environmental factors,
    human factors (user error), incorrect
    specification were all major sources of
    failure
Bohrbugs and Heisenbugs
   Classification proposed by Bruce Lindsey
   Bohrbug: like the Bohr model of the nucleus: solid,
    easily reproduced, can track it down and fix it
   Heisenbug: like the Heisenberg nucleus: a diffuse
    cloud, very hard to pin down and hence fix
   Anita Borr and others have studied life-cycle bugs in
    complex software using this classification
    Programmer facing bugs
?

               Heisenbug is fuzzy,
               hard to find/fix


              Bohrbug is solid,
          easy to recognize and fix
Lifecycle of Bohrbug
   Usually introduced in some form of code
    change or in original design
   Often detected during thorough testing
   Once seen, easily fixed
   Remain a problem over life-cycle of software
    because of need to extend system or to
    correct other bugs.
   Same input will reliably trigger the bug!
Lifecycle of Bohrbug


       A Bohrbug is boring.
Lifecycle of a Heisenbug
   These are often side-effects of some other
    problem
   Example: bug corrupts a data structure or
    misuses a pointer. Damage is not noticed
    right away, but causes a crash much later
    when structure is referenced
   Attempting to detect the bug may shift
    memory layout enough to change its
    symptoms!
How programmers fix a
Bohrbug
   They develop a test scenario that triggers it
   Use a form of binary search to narrow in on it
   Pin down the bug and understand precisely
    what is wrong
   Correct the algorithm or the coding error
   Retest extensively to confirm that the bug is
    fixed
How they fix Heisenbugs
   They fix the symptom: periodically scan the
    structure that is ususally corrupted and clean
    it up
   They add self-checking code (which may itself
    be a source of bugs)
   They develop theories of what is wrong and
    fix the theoretical problem, but lack a test to
    confirm that this eliminated the bug
   These bugs are extremely sensitive to event
    orders
Bug-free software is
uncommon
   Heavily used software may become extremely
    reliable over its life (the C compiler rarely
    crashes, UNIX is pretty reliable by now)
   Large, complex systems depend upon so
    many components, many complex, that bug
    freedom is an unachievable goal
   Instead, adopt view that bugs will happen
    and we should try and plan for them
Bugs in a typical distributed
system
   Usual pattern: some component crashes or
    becomes partitioned away
   Other system components that depend on it
    freeze or crash too
   Chains of dependencies gradually cause more
    and more of the overall system to fail or
    freeze
Tools can help
   Everyone should use tools like “purify”
    (detects stray pointers, uninitialized variables
    and memory leaks)
   But these tools don’t help at the level of a
    distributed system
   Benefit of a model, like transactions or virtual
    synchrony, is that the model simplifies
    developer’s task
Leslie Lamport
“A distributed system is one in which the failure
  of a machine you have never heard of can
  cause your own machine to become
  unusable”

   Issue is dependency on critical components
   Notion is that state and “health” of system at
    site A is linked to state and health at site B
Component Architectures Make it
Worse
   Modern systems are structured using object-
    oriented component interfaces:
       CORBA, COM (or DCOM), Jini
       XML
   In these systems, we create a web of
    dependencies between components
   Any faulty component could cripple the
    system!
Reminder: Networks versus
Distributed Systems
   Network focus is on connectivity but
    components are logically independent:
    program fetches a file and operates on it, but
    server is stateless and forgets the interaction
       Less sophisticated but more robust?
   Distributed systems focus is on joint behavior
    of a set of logically related components. Can
    talk about “the system” as an entity.
       But needs fancier failure handling!
Component Systems?
   Includes CORBA and Web Services
   These are distributed in the sense of our
    definition
       Often, they share state between components
       If a component fails, replacing it with a new
        version may be hard
       Replicating the state of a component: an
        appealing option…
            Deceptively appealing, as we’ll see
Thought question
   Suppose that a distributed system was built
    by interconnecting a set of extremely reliable
    components running on fault-tolerant
    hardware
   Would such a system be expected to be
    reliable?
Thought question
   Suppose that a distributed system was built by
    interconnecting a set of extremely reliable
    components running on fault-tolerant hardware
   Would such a system be expected to be reliable?
       Perhaps not. The pattern of interaction, the need to match
        rates of data production and consumption, and other
        “distributed” factors all can prevent a system from operating
        correctly!
Example
   The Web components are individually reliable
   But the Web can fail by returning inconsistent or
    stale data, can freeze up or claim that a server is not
    responding (even if both browser and server are
    operational), and it can be so slow that we consider it
    faulty even if it is working
   For stateful systems (the Web is stateless) this issue
    extends to joint behavior of sets of programs
Example
   The Arianne rocket is designed in a modular
    fashion
       Guidance system
       Flight telemetry
       Rocket engine control
       …. Etc
   When they upgraded some rocket
    components in a new model, working
    modules failed because hidden assumptions
    were invalided.
Arianne Rocket
     Attitude      Telemetry
     Control
        Guidance     Altitude

                   Accelerometer
      Thrust
      Control
Arianne Rocket
     Attitude      Telemetry
     Control
        Guidance     Altitude
     Overflow!
                   Accelerometer
       Thrust
       Control
Arianne Rocket
     Attitude      Telemetry
     Control
        Guidance     Altitude

                   Accelerometer
      Thrust
      Control
Insights?
   Correctness depends very much on the
    environment
       A component that is correct in setting A may be
        incorrect in setting B
       Components make hidden assumptions
       Perceived reliability is in part a matter of
        experience and comfort with a technology base
        and its limitations!
Detecting failure
   Not always necessary: there are ways to
    overcome failures that don’t explicitly detect
    them
   But situation is much easier with detectable
    faults
   Usual approach: process does something to
    say “I am still alive”
   Absence of proof of liveness taken as
    evidence of a failure
Example: pinging with
timeouts
   Programs P and B are the primary,
    backup of a service
   Programs X, Y, Z are clients of the
    service
   All “ping” each other for liveness
   If a process doesn’t respond to a few
    pings, consider it faulty.
Consistent failure detection
   Impossible in an asynchronous network that
    can lose packets: partitioning can mimic
    failure
       Best option is to track membership
       But few systems have GMS services
   Many real networks suffer from this problem,
    hence consistent detection is impossible “in
    practice” too!
   Can always detect failures if risk of mistakes
    is acceptable
Component failure detection
   An even harder problem!
   Now we need to worry
       About programs that fail
       But also about modules that fail
   Unclear how to do this or even how to
    tell
       Recall that RPC makes component use
        rather transparent…
Vogels: the Failure
Investigator
   Argues that we would not consider someone to have
    died because they don’t answer the phone
   Approach is to consult other data sources:
       Operating system where process runs
       Information about status of network routing nodes
       Can augment with application-specific solutions
   Won’t detect program that looks healthy but is
    actually not operating correctly
Further options: “Hot” button
   Usually implemented using shared memory
   Monitored program must periodically update
    a counter in a shared memory region.
    Designed to do this at some frequency, e.g.
    10 times per second.
   Monitoring program polls the counter,
    perhaps 5 times per second. If counter stops
    changing, kills the “faulty” process and
    notifies others.
Friedman’s approach
   Used in a telecommunications co-processor
    mockup
   Can’t wait for failures to be sensed, so his
    protocol reissues requests as soon as soon as
    the reply seems late
   Issue of detecting failure becomes a
    background task; need to do it soon enough
    so that overhead won’t be excessive or
    realtime response impacted
Broad picture?
   Distributed systems have many components,
    linked by chains of dependencies
   Failures are inevitable, hardware failures are
    less and less central to availability
   Inconsistency of failure detection will
    introduce inconsistency of behavior and could
    freeze the application
Suggested solution?
   Replace critical components with group of
    components that can each act on behalf of
    the original one
   Develop a technology by which states can be
    kept consistent and processes in system can
    agree on status (operational/failured) of
    components
   Separate handling of partitioning from
    handling of isolated component failures if
    possible
Suggested Solution


    Program
              Module
              it uses
Suggested Solution


    Program
                          Module
                          it uses    Transparent
              multicast   Module     replication
                           it uses
Replication: the key
technology
   Replicate critical components for availability
   Replicate critical data: like coherent caching
   Replicate critical system state: control
    information such as “I’ll do X while you do Y”
   In limit, replication and coordination are
    really the same problem
Basic issues with the approach
   We need to understand client-side
    software architectures better to
    appreciate the practical limitations on
    replacing a server with a group
   Sometimes, this simply isn’t practical
Client-Server issues
   Suppose that a client observes a failure
    during a request
   What should it do?
   Client-server issues




Timeout
Client-server issues
   What should the client do?
       No way to know if request was finished
       We don’t even know if server really
        crashed
       But suppose it genuinely crashed…
   Client-server issues




                     backup


Timeout
Client-server issues
   What should client “say” to backup?
       Please check on the status of my last request?
            But perhaps backup has not yet finished the fault-handling
             protocol
       Reissue request?
            Not all requests are idempotent
   And what about any “cached” server state? Will it
    need to be refreshed?
   Worse still: what if RPC throws an exception? Eg.
    “demarshalling error”
       A risk if failure breaks a stream connection
Client-server issues
   Client is doing a request that might be
    disrupted by failure
       Must catch this request
   Client needs to reconnect
       Figure out who will take over
       Wait until it knows about the crash
       Cached data may no longer be valid
       Track down outcome of pending requests
   Meanwhile must synchronize wrt any new
    requests that application issues
Client-server issues
   This argues that we need to make
    server failure “transparent” to client
       But in practice, doing so is hard
       Normally, this requires deterministic
        servers
            But not many servers are deterministic
       Techniques are also very slow…
Client-server issues
   Transparency
       On client side, “nothing happens”
       On server side
            There may be a connection that backup needs
             to take over
            What if server was in the middle of sending a
             request?
            How can backup exactly mimic actions of the
             primary?
Other approaches to consider
   N-version programming: use more than one
    implementation to overcome software bugs
       Explicitly uses some form of group architecture
       We run multiple copies of the component
       Compare their outputs and pick majority
            Could be identical copies, or separate versions
            In limit, each is coded by a different team!
Other approaches to consider
   Even with n-version programming, we get
    limited defense against bugs
       ... studies show that Bohrbugs will occur in all
        versions! For Heisenbugs we won’t need multiple
        versions; running one version multiple times
        suffices if versions see different inputs or different
        order of inputs
Logging and checkpoints
   Processes make periodic checkpoints, log messages
    sent in between
   Rollback to consistent set of checkpoints after a
    failure. Technique is simple and costs are low.
   But method must be used throughout system and is
    limited to deterministic programs (everything in the
    system must satisfy this assumption)
   Consequence: useful in limited settings.
Byzantine approach
   Assumes that failures are arbitrary and may be
    malicious
   Uses groups of components that take actions by
    majority consensus only
   Protocols prove to be costly
       3t+1 components needed to overcome t failures
       Takes a long time to agree on each action
   Currently employed mostly in security settings
Hard practical problem
   Suppose that a distributed system is built from
    standard components with application-specific code
    added to customize behavior
   How can such a system be made reliable without
    rewriting everything from the ground up?
   Need a plug-and-play reliability solution
   If reliability increases complexity, will reliability
    technology actually make systems less reliable?

								
To top