Crash-Only Web Services Failure Semantics in an SOA Environment

Document Sample
Crash-Only Web Services Failure Semantics in an SOA Environment Powered By Docstoc

     Crash-Only Web Services:
    Failure Semantics in an SOA
          Chris Hobbs and Abbie Barbir
                    Presented by

                   Paul Knight
           OASIS Symposium 2007, San Diego

      The crash-only model
                                    Software design
                                    Easier to restart
                                     quickly in a known
                                     state than to clean up
                                     and rebuild to
                                     recover from an error

    George Candea and Armando Fox are
2   key proponents of crash-only software
    Two themes of this talk
       Discuss issues of the behaviors of individual
        and composed services and their part in Web
        Services Service Level Agreements (WSLA)
           Based on the behaviors of the individual services
           Need a taxonomy or ontology of service behaviors
           Need an approach to calculating behaviors of
            composed services
       The “crash-only” model of operation as a
        simple failure behavior for a Web Service
           Failure is one of many identified behaviors
Background: Orchestration as a New
Programming Paradigm
       SOA promotes the concept of combining services
        through orchestration - invoking services in a defined
        sequence to implement a business process
       Orchestration compounds the difficulties of testing and
        managing the quality of the deployed services
       Testing composite services in SOA environment is a
        discipline which is still at an early stage of study
       Describing and usefully modeling the individual and
        combined behaviors - needed to offer Service Level
        Agreements (SLA) - is at an even earlier stage
       We hope to stimulate additional research on these topics

    Testing Composed Services
       It’s fairly straightforward to test the operation of
        a device or system if we control all the parts.
       When we start offering orchestrated services
        as a product, the services we are using may be
        outside our control.
       For example consider well-known components:
           Google mapping service
           Amazon S3 storage service
           Mobile operator’s location service

    Testing Composed Services (2)
       With orchestrated services, there is never a
        complete “box” we can test
       With orchestration as the new programming
        paradigm, testing becomes a much bigger
       Failures of orchestrated services are often
        “Heisenbugs” - impervious to conventional
        debugging, generally non-reproducible
       Offering a WSLA based on testing alone,
        without reliable knowledge of component
        service behaviors, may be risky

    Web Services SLA (WSLA)
               Packets                            Provider X
    Client                   Service Provider Z   Service X
                              Web Service

                                                  Provider Y
                                                  Service Y

                             Message flows

   Concerned with behaviors of the message flows and
    services spanning the end-to-end business transaction
   Clients can develop testing strategies that stress the
    service to ensure that the service provider has met the
    contracted WSLA commitment
   Composed services make offering a WSLA more risky
    How can WSLAs be derived from
    behaviors of component services?
       Need to develop a model of the
        behavioral attributes of the individual
        component Web Services which
        contribute to the overall behavior of
        an orchestrated or composed Web
       Need to model the combination of
        individual service behavioral models
        Web Services behaviors
                                             Availability and
       Behaviors may be described       

        and quantified for each Web         Performance
        Service                             Management
                                            Failure
       May be combined by a                Security
        “calculus of behaviors” when        Privacy, confidentiality
                                             and integrity
        multiple services are composed
                                            Scalability
       Behavior parameters may             Execution
        become a part of the service        Internationalization
                                            Synchronization
        description, perhaps in WSDL.       Etc., …

     Web Services behaviors (2)
        To develop a Service Level
         Agreement (SLA) for a composed
         service (Z), we need to have relevant
         behavior descriptions for the individual
         services (X and Y)
                                  X   Z    Y

        We also need a deep understanding
         of how to combine the descriptions of
         X and Y to calculate results for Z
     Web Services behaviors (3)
        For each behavior, the challenges include the
        1. How may service X’s and service Y’s behavior
         be characterized?
        2. How may those characterizations be formalized
         and advertised by X and Y?
        3. How may Z incorporate X’s and Y’s
         characterizations and then advertise the result?

        Z itself might become a component of an even
         larger service and therefore needs to advertise its
         own characteristics. It also needs this
         characterization to offer an SLA to consumers.
     Web Services behaviors (4)
        Each behavior may have its own ontology,
         measures, and calculus of combining those
         measures when services are composed.
         Ontology                  Z – Specific

            X        Ontology

            Y       Abstracted
                                         ?         Local
                                                  Ontology        Z

          Local                  Need this analysis for each behavior
         Ontology                       of services X, Y and Z
     Web Services behaviors (5)
         Ten behavior examples
               Availability and Reliability
               Performance
               Management
               Failure (Crash-only is one mode)
               Security
               Privacy, confidentiality and integrity
               Scalability
               Execution
               Internationalization
               Synchronization
         Let’s focus on a few of these behaviors…

13       Source: “Advertising Service Properties,” unpublished paper by C. Hobbs, J. Bell, P. Sanchez
     Availability and Reliability
        “Availability” is the percentage of client
         requests to which the server responds
         within the time it advertised.
        “Reliability” is the percentage of such
         server responses which return the correct
        In some applications availability is more
         important than reliability
            Many protocols used within the Internet, for
             example, are self-correcting and an
             occasional wrong answer is unimportant. The
             failure to give any answer, however, can
             cause a major network upheaval.
     Availability and Reliability (2)
        In other applications reliability is
         more important than availability
            If the service which calculates a
             person’s annual tax return does not
             respond occasionally it’s not a major
             problem - the user can try again
            If that service does respond but with
             the wrong answer which is submitted to
             the tax authorities, then it could be
     Availability and Reliability (3)
        Services are built with either availability or
         reliability in mind, with clients accepting that
         no service can ever be 100% available or
         100% reliable.
        In combining services X and Y into a
         composite service Z, it is necessary to
         combine the underlying availability and
         reliability models and predict Z’s model.
        To do so without manual intervention, X’s
         and Y’s models must be exposed.
     Availability and Reliability (4)
        Availability and reliability models are
         often expressed as Markov Models or
         Petri Nets, which are easy to combine
         in a hierarchical way.
        Major issues:
            Agreeing upon the semantics of the states
             in the Markov model or places in the Petri
            Finding a way for X and Y to publish the
             models in a standard form.
     Availability and Reliability (5)
        Currently, apart from raw percentage figures,
         there is no method for describing these
            Percentage time when the server is unavailable?
            Percentage of requests to which it does not
            Different clients may experience these differently
            A server which is unavailable from 00:00 to 04:00
             every day can be 100% available to a client that
             only tries to access it in the afternoons.

     Availability and Reliability (6)
        If X and Y are distributed, then it is
         possible, following network failures,
         that for some customers, Z can
         access X but not Y and for others Y
         but not X.
        The assessment of Z’s availability
         may be hard to quantify, so it may
         be difficult for Z to offer a meaningful
        The failure models of X and Y may be very
            X fails cleanly and may, because of its idempotency,
             immediately be called again
            Y has more complex failure modes
            Z will add its own failure modes to those of X and Y
            Predicting the outcome could be very difficult
        The complexity is increased because many
         developers do not understand failure modeling
         and, even were models to be published, their
         combination would be difficult due to their
         stochastic nature.

     Failure (2)
        One approach to describing a service’s failure
             Service publishes the exceptions that it can raise
             and associates the required consumer behavior
             with each
            “Exception D may be thrown when the database is
             locked by another process. Required action is to
             try again after a random backoff period of not less
             than 34ms.”
        “Crash-only” failure model is a simple starting
         point for building a taxonomy of failure
         behavior. This work is just beginning.

        A behavioral description and WSLA for the
         composite service Z must include its scalability
        How many simultaneous service instances can it
        What service request rate does it handle? etc.
        These parameters will almost certainly differ
         between the component services X and Y, and will
         need to be published by those services.
        X and Y are presumably not dedicated solely to Z,
         so the actual load being applied to X and Y at any
         given time is unknown to the provider of Z, making
         the scalability of Z even harder to determine.
     Web Services behaviors (again)
        Ten behavior examples
            Availability and Reliability
            Performance
            Management
            Failure (Crash-only is one mode)
            Security
            Privacy, confidentiality and integrity
            Scalability
            Execution
            Internationalization
            Synchronization
        We described a few of these behaviors…
        Can we use them to build WSLAs?
 Web Service Level Agreement (WSLA)
        Based on behaviors and descriptors for
         these behaviors.
        Example: Failure model
            Is transaction half-performed?
            Is it re-wound?
        These behaviors and descriptors are not
         available in the WS description, in WSDL
            No performance info
            Not even price!
 Web Service Level Agreements (2)
        Business acceptance of composed services for
         business-critical operations depends on a
         service provider’s ability to offer WSLA
            Uptime, response time, etc.
            Offering a WSLA depends on ability to compose the
             WSLA-related behaviors of the individual services
            This information needs to be available via WSDL or
             similar source
            Should include test vectors to test the SLA claims
        The ability to determine and offer a WSLA
         commitment is a limiting factor for widespread
         acceptance of services based on orchestration
 Web Service Level Agreements –
        Need a more precise way to express the
         parameters of behaviors
            Availability – What is 99.97% uptime?
                 Several milliseconds outage each minute?
                 Several minutes planned downtime each month?
            Failure model – Crash-only as the simplest, lowest
             layer or level of failure in a future full failure model.
            Eight other SLA-related behaviors listed here – each
             has a complex semantic for description and
        More questions than answers now - many PhDs
         still to be earned in this area!
     Back to the crash-only
     software model
                    Can it simplify

                     service composition,
                     testing, development
                     of WSLA, and end
                     world hunger?

     Crash-only software (1)
        Historically, developers have spent a lot of
         effort making software resilient
            Put borders around it so it will not affect other
             things if it fails
            Try to close it down cleanly
            Save state
            Reload the software component
            Restart and replay
        Trying to keep the client from becoming
         aware that a failure occurred
     Crash-only software (2)
        Years of work over last ten years on
         resilient software - which stays up all
         the time, and recovers from problems
            For example, tutorials by Bev Littlewood
        Crash-only software is the exact
            Client accepts that the server may crash
                 Power failure, network down, hardware, etc.
                 Client must be able to recover or restart the
                  process by itself

         Crash-only software (3)
        Crash-only principles
             Forget recovery - more trouble than it’s worth
             When the server senses a problem, it will “crash” as
              cleanly as possible and may perform a “micro-reboot”
              to return to original state
                  Sometimes recover to a well-defined checkpoint
                  Client may initiate the crash
             The server is back working sooner than if it tried to
              recover via logs and journals, etc.
        Principles fit the Web Services paradigm nicely!
             Loose coupling of services
             Little state shared among services
       Crash-Only Software (4)
    Crash-only semantic has several advantages:
       Simpler macroscopic behavior with fewer externally
        visible states
       Reduces outage time by removing all shutting-down time

       Simplifies failure model by reducing recovery state table
       Crashing can be invoked from outside the software of
        the provider
          Recovery from a failed state is notoriously difficult
           and the crash-only paradigm coerces the system into
           a known state without attempting recovery
          Reduce the complexity of the provider code

       Simplifies testing by reducing the failure combinations
        that have to be verified. Consumer is assumed to be
        able to initiate the crash.

         Crash-Only Web Services

    Candea’s list of properties required for a crash-only
     system can be abstracted to match properties of Web
        Components have externally enforced boundaries. This is
         supported by the virtual machine concept used on many Web
         Service systems
        All interactions between components have a timeout. This is
         implicit in any loosely-coupled Web Services interaction.
        All resources are leased to the service rather than being
         permanently allocated. This is particularly useful in Web Services.
        Requests are entirely self-describing. For crash-only services this
         requires that the request carries information about time-to-live and
         idempotency – will it return the same result if invoked again?.
        All important non-volatile state is managed by dedicated state
Crash-Only Reliable Web Service

     Web Service             Web Services       Server
                              Endpoint                      Crash-Only
     Consumer      Interne
                   t                                         Backend
                              Stall Proxy      WSM
                                              Recovery       Backend

                                 Reliable SOAP Protocol
     For systems with hardware redundancy, by using crash only techniques,
      SOAP & WS-RM can be extended in order to produce an always available
      Web Service from the provider’s and consumer’s point of view
     WSLA response time may be at risk if a service is forced to crash
    Testing Web Services in an SOA environment is
     a discipline that is still in its infancy
    There are no standard models to describe or
     combine Web Services behavior information
     across various services and providers
    Web Services SLAs (WSLAs) for composed
     services are problematic
        Testing is only a partial solution
        Behavioral composition needs work, but is promising
    Crash-only Web Services can address some of
     these difficulties
    There are many related areas for further work
     Q &A