"Composing workflows in the environmental sciences using Web "
Web Services in Scientific Applications Jon Blower Reading e-Science Centre Intro Web Services provide significant benefits of platform neutrality and wide adoption Highly suitable for simple request-response interactions e.g. get the current temperature at a given location However, “plain” Web Services are, on their own, not always appropriate for: cases where the request-response time has to be very short long-running services (e.g. performing a calculation that takes minutes or more to run) services that consume or produce large amounts of data Unfortunately this describes many scientific applications! So what can we do about this? Problem 1: I need a fast request-response time Web Services work by exchanging XML documents. XML parsing is relatively slow. Therefore there is not a lot we can do to increase the responsiveness (i.e. decrease the latency) of a Web Service! WS only really suitable when the XML parsing time can be tolerated. You would not normally use Web Services to exchange messages in a tightly-coupled compute cluster! Use MPI (or similar) instead (But you might provide a Web Service job submission interface to the whole cluster) Problem 2: My service will take a long time to run If the service is long-running, can run into problems with connection timeouts etc Also probably not good practice to block the calling program until the WS completes One solution is for the WS to return immediately with a “ticket” number (or job ID). The client can call the WS again with this ticket to get progress Or the user could supply an email address and the remote system sends an email when the service finishes. The future (probably) lies in WS-Notification, an emerging standard for notifying WS clients of progress and other state Part of WS-RF (WS Resource Framework) Potential problem with firewalls! Problem 3: My service uses large amounts of data Web Services communicate via XML (SOAP) messages Input data and parameters are turned into plain-text strings and included in the message Arrays become very large in the XML message e.g. array of 3 integers in binary = 12 bytes in XML array is represented as a long string (“<data>1</data><data>2</data><data>3</data>” = 42 bytes) XML takes time and RAM to create and parse It is sometimes suggested that XML documents should not be >4MB for these reasons Conclusion: We shouldn’t put large datasets in the SOAP message. Large datasets in workflows Furthermore, in a workflow environment, we want large datasets to travel in the most direct way possible: Process Extract data Create data picture Path of SOAP Client messages (workflow Desired path of data engine) Large datasets: solutions SOAP with attachments: (rather like attachments to emails) Data don’t “bloat” by being translated to XML (but do increase by ~33% in translation to MIME attachment) But data are still transported with the SOAP message Often used for passing around image files Pass pointers to datasets e.g. GADS (data extraction service) prepares the extracted data, puts it on a separate HTTP server and then returns a URL to the data. Have to run another server Also, have to manage the cache of extracted data Large datasets: solutions (2) Perhaps a better solution is to stream data directly between services Similar to Unix filter commands: ➢ extract | process | render Would not require data to be cached on, say, an HTTP server Services could run concurrently; for example, the first chunk of data extracted could be processed while the second chunk is being extracted. No standards-based solution to this AFAIK ReSC have developed a data-streaming framework based on the Styx protocol for distributed systems Allows workflow engine to pass around pointers to streams Built into current release of Taverna workflow system Very easy to wrap existing binary code (direct access to streams) Data streaming between remote services Have developed “Styx Grid Services” that can be composed in such a way that data streams directly between them Can wrap SGSs in Web Services wrapper Can send streams to multiple locations Also gives method for progress monitoring in a way that won’t be defeated by firewalls Health warning - work This is the Triana workflow system: this framework will in progress! work with any WS-based WF system but currently integrates best with Taverna (but still work-in-progress) Summary Web Services provide a platform-neutral layer for accessing remote resources But “plain” Web Services don’t address many of the problems faced in the scientific world WS-RF is probably the horse to back for a widely-accepted, standards- based method of addressing some of these issues No actual, widely-available implementation yet (except WSRF::Lite) Globus Toolkit 4 will be based on WS-RF (in alpha release) But remember what happened to GT3! ReSC has a useful solution (still immature) that solves many such issues plans to build into CDAT and provide better tooling Python scripting interfac