Composing workflows in the environmental sciences using Web by 0ORG1To


									Web Services in Scientific Applications

                 Jon Blower
           Reading e-Science Centre

   Web Services provide significant benefits of platform neutrality
    and wide adoption
   Highly suitable for simple request-response interactions
        e.g. get the current temperature at a given location
   However, “plain” Web Services are, on their own, not always
    appropriate for:
        cases where the request-response time has to be very short
        long-running services (e.g. performing a calculation that takes
         minutes or more to run)
        services that consume or produce large amounts of data
   Unfortunately this describes many scientific applications!
   So what can we do about this?
         Problem 1: I need a fast
         request-response time
   Web Services work by exchanging XML documents.
   XML parsing is relatively slow.
   Therefore there is not a lot we can do to increase the
    responsiveness (i.e. decrease the latency) of a Web Service!
   WS only really suitable when the XML parsing time can be
   You would not normally use Web Services to exchange
    messages in a tightly-coupled compute cluster!
        Use MPI (or similar) instead
   (But you might provide a Web Service job submission interface
    to the whole cluster)
Problem 2: My service will take
      a long time to run
   If the service is long-running, can run into problems with
    connection timeouts etc
        Also probably not good practice to block the calling program until
         the WS completes
   One solution is for the WS to return immediately with a “ticket”
    number (or job ID).
        The client can call the WS again with this ticket to get progress
   Or the user could supply an email address and the remote
    system sends an email when the service finishes.
   The future (probably) lies in WS-Notification, an emerging
    standard for notifying WS clients of progress and other state
        Part of WS-RF (WS Resource Framework)
        Potential problem with firewalls!
    Problem 3: My service uses
      large amounts of data
   Web Services communicate via XML (SOAP) messages
   Input data and parameters are turned into plain-text strings and
    included in the message
   Arrays become very large in the XML message
        e.g. array of 3 integers in binary = 12 bytes
        in XML array is represented as a long string
         (“<data>1</data><data>2</data><data>3</data>” = 42 bytes)
   XML takes time and RAM to create and parse
   It is sometimes suggested that XML documents should not be
    >4MB for these reasons
   Conclusion: We shouldn’t put large datasets in the SOAP
     Large datasets in workflows

    Furthermore, in a workflow environment, we want large datasets
     to travel in the most direct way possible:

              Extract           data             Create
               data                              picture

    Path of SOAP               Client
    messages                 (workflow
    Desired path of data      engine)
         Large datasets: solutions

   SOAP with attachments:
        (rather like attachments to emails)
        Data don’t “bloat” by being translated to XML (but do increase by
         ~33% in translation to MIME attachment)
        But data are still transported with the SOAP message
        Often used for passing around image files
   Pass pointers to datasets
        e.g. GADS (data extraction service) prepares the extracted data,
         puts it on a separate HTTP server and then returns a URL to the
        Have to run another server
        Also, have to manage the cache of extracted data
    Large datasets: solutions (2)
   Perhaps a better solution is to stream data directly between
        Similar to Unix filter commands:
          ➢   extract | process | render
        Would not require data to be cached on, say, an HTTP server
        Services could run concurrently; for example, the first chunk of
         data extracted could be processed while the second chunk is being
   No standards-based solution to this AFAIK
   ReSC have developed a data-streaming framework based on the
    Styx protocol for distributed systems
        Allows workflow engine to pass around pointers to streams
        Built into current release of Taverna workflow system
        Very easy to wrap existing binary code (direct access to streams)
    Data streaming between remote
   Have developed “Styx
    Grid Services” that can
    be composed in such a
    way that data streams
    directly between them
   Can wrap SGSs in Web
    Services wrapper
   Can send streams to
    multiple locations
   Also gives method for
    progress monitoring in
    a way that won’t be
    defeated by firewalls
   Health warning - work     This is the Triana workflow system: this framework will
    in progress!              work with any WS-based WF system but currently
                              integrates best with Taverna (but still work-in-progress)
   Web Services provide a platform-neutral layer for accessing remote
   But “plain” Web Services don’t address many of the problems faced in
    the scientific world
   WS-RF is probably the horse to back for a widely-accepted, standards-
    based method of addressing some of these issues
        No actual, widely-available implementation yet (except WSRF::Lite)
        Globus Toolkit 4 will be based on WS-RF (in alpha release)
        But remember what happened to GT3!
   ReSC has a useful solution (still immature) that solves many such
        plans to build into CDAT and provide better tooling
        Python scripting interfac

To top