TAVERNA_ A toolkit for composing and enacting workflows for

Document Sample
TAVERNA_ A toolkit for composing and enacting workflows for Powered By Docstoc
					Taverna: lessons in creating a workflow environment for the life sciences


Tom OINN1, Matthew ADDIS2, Justin FERRIS2, Kevin GLOVER3, Mark
GREENWOOD4, Darren MARVIN2, Peter LI5, Matthew R. POCOCK5, Martin
SENGER1, Anil WIPAT5 and Chris Wroe4

1
    EMBL European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK,
2
    IT Innovation Centre, University of Southampton, SO16 7NP, UK,
3
    School of Computer Science and Information Technology, University of Nottingham,
NG8 1BB, UK
4
    Department of Computing Science, University of Manchester, M13 9PL, UK,
5
    School of Computing Science, University of Newcastle, NE1 7RU, UK.


*Authors and order needs to be checked
SUMMARY


Life sciences research is based on individuals, often with diverse skills, assembled
into research groups. These diverse research groups combine their specialist expertise
and skills to address scientific problems. Their in silico experiments involve the co-
ordinated use of analysis and information resources that may be globally distributed.
In Grid terms the interest is in sharing these analysis and information resources rather
than sharing computational resources. The Taverna project has developed a toolkit for
the composition and enactment of workflows (in silico experiments) for the life
sciences community. This experience paper describes lessons learnt during the
Taverna‟s development, in particular the areas highlighted by initial use cases, and
also where translating technological solutions into benefits for scientists has proved
harder than expected. A common theme in these lessons is the importance of
understanding how the workflows fit into the scientists‟ experimental context. The
lessons reflect a developing understanding of life scientists‟ requirements on a
workflow environment, which is relevant to other areas of data intensive, exploratory
science.

1 Introduction

The GGF10 workshop [1], and the formation of a workflow management research
group (WFM_RG) illustrate the significant and growing interest in workflows within
the GRID community [2]. Workflow techniques, methods and systems are being
successfully used by e-scientists. There are two related motivations. First, the use of
workflows to explicitly describe e-science experiments so that they can be shared,
modified, and reused. Second, the use of workflow engines so that scientists can “run
experiments” without detailed knowledge of the underlying GRID infrastructure.

The Taverna project has developed a toolkit for the design and execution of
exploratory, data-intensive in silico experiments in the life sciences [3][4]. Figure 1
shows the Taverna workbench which is the toolkit component most visible to users.
Taverna, and the associated Freefluo workflow engine, are workflow parts of the
myGrid UK e-science project [5][6], which is developing a range of e-science
middleware components for bioinformatics. As with all myGrid components there is a
focus on the overall support provided to the virtual organisations formed by
collaborating scientists. In short, the key question is not, “Is Taverna a good workflow
workbench?” but “How well does Taverna support life scientists, either on its own or
in collaboration with other components?”

myGrid is a semantic grid project: it involves combining semantic web and grid
technologies. In particular, semantic descriptions can be attached as metadata to all
scientists‟ experimental holdings, in order that relevant items can be located based on
semantic as well as syntactic types.

This paper discusses experiences with the development of the Taverna workflow
system, covering design, implementation and the interaction with other components.
Section 2 covers the background: the requirements of a workflow system for in silico
experiments in the life sciences. Section 3 presents related work, comparing Taverna
with other e-science workflow systems. The development experiences are described
in terms of challenges in making Taverna both useful and useable, along with current
and prospective solutions. Section 4 concentrates on issues in workflow design;
section 5 on workflow execution; and section 6 on metadata and provenance issues.
Finally section 7 discusses the lessons from these experiences highlighting that in
many cases the issues and solutions involve related technical and non-technical
aspects.


2 Background

Much of biology is based on comparative and speculative reasoning; predicting what
might happen based on its relationship to "similar" things seen and studied previously.
Discovery involves combining and collating results obtained from a number of
analyses and data resources. These in silico experiments complement experiments
performed in the laboratory by synthesising new information from available data and
generating hypotheses for confirmation in the laboratory. From the early 1990s, the
biological community has enthusiastically adopted web technology to disseminate
data and analysis methods. Bioinformaticians can carry out simple, "low volume" in
silico experiments by cutting and pasting data between web pages. However, the
complexity of potential in silico experiments together with the volumes of data
produced by high throughput technologies is now threatening to overwhelm standard
web technology. Analysis methods are constantly and rapidly evolving. As more
resources become available, more in silico experiments can be done, thus generating
more resources and knowledge for designing further experiments.

The myGrid project is an e-science pilot project funded by the Engineering and
Physical Sciences Research Council in the UK. The team responsible for the
workflow aspect of myGrid‟s work has been expanded by significant volunteer effort
both from within the myGrid project and also from other institutes who have an
interest in seeing the technology achieve its potential. As a result of this, the Taverna
project is a “subproject” of myGrid, allowing rapid dissemination of our work and
facilitating easy communication and collaboration with agents outside of our primary
project framework. As a useful simplification the Taverna “subproject” can be divided
into a number of parts. There is the Freefluo workflow enactment engine, developed
by IT Innovation, which provides the enactment capability [7]. Freefluo is a separate
“subproject” as the core Freefluo engine is workflow language independent. Taverna
provides the relevant extensions that specialise Freefluo to its Scufl language (Simple
conceptual unified flow language) [4]. In addition, Taverna provides a workbench
application that enables users to create, edit and view Scufl workflows. This Taverna
workbench includes a built-in Freefluo enactment engine for Scufl, so users can run
workflows from their editing environment. The overall effect is that the distinction
between Taverna and Freefluo is barely visible to users. This is consistent with the
general Taverna philosophy that users should not have to know technical
implementation details unless it is relevant to their work. Both Taverna and Freefluo
are open source projects licensed under the LGPL, implemented in Java, and have
proved to be portable without any major issues. Code and further documentation on
the specifics of the project are available at the respective web sites [3] and [7].

The Taverna workbench, as shown in Figure 1, has some of the characteristics of a
portal. Users can load a workflow, either from local filestore or from the web, and
execute (enact) the workflow using the in-build Freefluo engine. The workbench
provides facilities for users to supply inputs, observe status information during
execution, and to browse and save results. It also provides a palette of known services
that can be used when creating or amending a workflow. This palette is based on
service descriptions that have been retrieved from the web. Overall the Taverna
workbench provides its users with their view of the web for creating, amending and
executing workflows [8].

The myGrid project and the Taverna (sub-)project have sought to align research
efforts with the requirements of current work in the life sciences domain. The aim is
for evolution not revolution: delivering new technology that can be adopted now and
helps scientists to concentrate on the science, not worry about computational
infrastructure. To align research and current scientific practice there has been close
collaboration with two groups performing research into the genetics of human disease.
These groups are based in Institutue for Human Genetics at Newcastle‟s Centre for
Life, and in St Mary‟s Hospital at the University of Manchester in the UK. They are
working on the genetic basis for, respectively, Graves‟ disease [9][10] and Williams-
Beuren syndrome [11]. Both these associations have been invaluable in both
providing feedback on the efficacy of Taverna and myGrid, and demonstrating that
workflow and associated technologies are, while still nascent, already at a level of
stability where they can provide a genuine advantage to the life science domain.

An example Taverna workflow is shown in Figure 2. The Taverna workbench
provides a number of views on a Scufl workflow model, this diagrammatic view
displays a limited amount of information, as the aim is to give the user an overview of
the workflow. (Users become quite adept at recognising workflows from such
diagrammatic views.) This view is generated using the Dot tool from GraphViz to
render the workflow as a PNG image [12]. The details are not important for this paper;
further detailed explanation of two workflows from the Graves‟ disease study is
available in [4], and several demonstration examples are available from the Taverna
web site [3]. The general scheme is that a workflow represents the transformation
from a set of inputs to a set of outputs. This transformation is achieved by a network
of processors. The connections between the processors are almost all data links from
an output of one processor to the input of another. A processor waits for its inputs,
invokes its “transformation” service, typically a web service operation, and sends
results on its outputs. Data flow is the basic computational paradigm: the workflow
processors work concurrently unless data availability, or an explicit coordination
constraint, forces one processor to wait for another.
2.1 Lifecycle of in silico experimentation
The current typical approach to bioinformatics experimentation involves the scientist
using a number of web resources with form-based interfaces. The coordination of
these resources involves the scientist cutting and pasteing between web pages, and
storing results using local filestore. Using a workflow makes it easier for scientists to
describe and run their experiments in a structured, repeatable and verifiable way.
However, it is important to recognise that for the scientist, the execution of a
workflow is just one step within the wider experimental context:

1. discovering the resources that could be used, and discovering previous
   experiments that illustrate the use of those resources

2. forming experiments through the composition of resources into experiments, the
   adaptation of existing experiments

3. personalising a general experimental approach based on a scientist‟s current
   context and particular preferences

4. executing an experiment, including monitoring its progress

5. managing the results of an experiment, recording the provenance and any
   appropriate annotation (also managing other related data: e.g. inputs used,
   workflow descriptions, service descriptions)

6. sharing resources, workflows and results with colleagues, and publishing the
   experiment and results to the wider community

Clearly, there is a cycle here. Scientists can only reuse existing resources that have
been made available within their virtual organisation, and future experiments will
exploit the experience from current and previous ones.


2.2 Workflows in an open world
One area in which myGrid, and in particular Taverna, differ from other projects is the
avoidance of a „closed world‟ model for the underlying grid and service architecture.
A major deliverable of both software and theory is that they should be able to cope
with the diverse heterogeneous services, organisations and methodologies found in
the real world. This aim has brought with it challenges that are generally not issues for
systems where everything is under one single authority, and yet allows Taverna,
where successful, to provide a genuinely useful set of tools and best practices to our
target audience, the working scientists.

Life scientists are used to making use of a wide variety of web-based resources. It has
been recognised that composing together resources with interfaces designed for
humans is difficult and error-prone [13]. The emergence of Web Services [14], along
with the availability of suitable tool support, has seen a significant number of
bioinformatics web resources becoming publicly available through SOAP over http,
and described with a WSDL interface. Early examples included the XEMBL [15], and
openBQS [16] hosted by the European Bioinformatics Institute (EBI), and the
services provided by XML Central of DDBJ [17]. More recently the range of
available web services is growing: the BioMoby project is gathering an expanding
range of services [18], the EBI have made the EMBOSS suite of 100+ sequence
analysis tools [19] available using Soaplab [20], pathway data are available from the
KEGG API [21], and a range of analysis services are offered by the PathPort project
[22].

The key factor for Taverna is that its potential users want to combine these public, and
perhaps their own private services, as easily as possible. Taverna does not have a
monopoly on the use of these services: they will be used by a range of different clients
because of the diverse nature of the life sciences.


2.3 Workflows in e-science and e-business
The move to service oriented architectures, in both the scientific and business
domains, has generated significant interest in workflow as an approach to the flexible
and loose coupling of services to address a specific goal. In the business domain
typical scenarios include the coordination of a number of business services to address
a customer, and the rapid assembly of a business supply chain to address a market
niche. In the scientific domain the generic scenario is a virtual organisation of
scientists assembling services (resources) into experiments investigating a scientific
problem. The nature of these services and experiments can differ significantly
between scientific disciplines. The general concept of improved flexibility through the
ability to rapidly establish appropriate virtual organisations (VOs) is common the e-
science and e-business. This often involves the use of explicit workflows to support
and manage the VO processes. Although there are many similarities, scientific
workflows have several challenges that are not present for business workflows [23].
The importance of data in science, which are often large, complex, heterogeneous,
(and can be expensive to recreate) means that scientific workflows have to cope with
such data as both inputs and outputs that must be archived for future use. In addition,
many scientific workflows are computationally intensive.


3 Related Work
To accomplish many tasks in bioinformatics, different data access and analysis tools
are used over different data elements in a particular cascade of data transformation
steps. Particular difficulties arise from the heterogeneous nature of the data, and the
fact that many of the most respected tools use their own peculiar input and output
formats [13][24].

The potential benefits of integrating bioinformatics resources (tools) to match users‟
experiments have led to the development of a range of integration systems. There are
two basic approaches: the standard interface approach and the workflow approach. As
biology is characterised by a huge number of research groups, each with their own
speciality, the integration and comparison of data often leads to these groups being
both the providers and consumers of bioinformatics resources. A useful simplification
is that the standard interface approach focuses on making resource delivery easy for
the providers, while the workflow approach focuses on making resource use easy for
scientists, aiming to contribute to an integrated experimental environment.

In the standard interface approach a range of resources are given a standard interface
and input/output mechanisms. This enables scientists to write programs or scripts that
combine these into experiments. Examples of this standard interface approach are
BioMoby [18] and Soaplab [20]. In both of these the standard interfaces include not
only operational interfaces for the data analyses but also meta-data interfaces that
provide information about the nature and types of the analyses.

In the workflow approach the emphasis is less on the resources and more on an
explicit representation of how they are composed. The workflow is seen as an
experimental entity, the in silico equivalent of a laboratory protocol. Workflows, as
capturing information about how experiments are performed, are valuable in their
own right. This is emphasised by the creation of abstract workflows, which describe
an experiment generically, rather than tying it down to specific tools. Examples of
these bioinformatics-based workflow systems include the PLAN programmable
integrator [25], structural genomic workflows [26], BioOpera [27] and Taverna. One
common abstract workflow in biology is the annotation pipeline. In an annotation
pipeline there is an initial input, often a DNA or protein sequence, and then a series of
analyses tell the scientist as much as possible about the initial input sequence, based
on other experiments and literature referring to the that input sequence or biologically
similar sequences.

The standard interface and workflow approaches can be combined. This combined
approach gives an effective system over a closed world of resources and has been
used by commercial systems in the bioinformatics area: e.g. Incogen VIBE (Visually
Integrated Bioinformatics Environment) [28], TurboWorx [29], PipelinePilot [30].
The mine-it bioinformatics modelling tool specialising in the advanced analysis of
gene-expression and micro-array data is an example of the combined approach being
exploited by a specific community [31].

Taverna is just one of several scientific workflow systems which share a common
theme of an XML-based workflow representation, of the composition of e-science
resources into experiments, and a workflow engine for enacting (executing) those
experiments. As Taverna has been developed by general workflow ideas and a focus
on the bioinformatics domain and web services, other scientific workflow systems
have been initially developed in the context of specific domains. The Kepler
workflow system [32] has been developed for scientists with a range of interests:
ecoinformatics, geoinformatics and bioinformatics. It builds on Ptolemy II, a mature
application from the electrical engineering domain [33]. Like Taverna, Kepler is
dataflow oriented, with the core description being the processing of data through a set
of connected actors. Triana [34] is another scientific workflow system, which was
originally developed as a quick data analysis problem solving environment for a
gravitational wave detection project. It is also dataflow oriented, and is particularly
strong at exploiting the structure available in signal processing data. Triana is aimed
at CPU intensive engineering and scientific applications, primarily allowing scientists
to compose their local applications and distribute the computation on a set of Triana
servers. Taverna, Kepler and Triana are all generic in the sense that their workflow
languages are not domain specific, so they could be applied in other domains, but they
have libraries of services and workflows that represent significant domain knowledge.

Scientific workflow systems can clearly be categorised in terms of their use within a
scientific domain. However, they can also be categorised in terms of the “services”
that are composed into workflows. Taverna is based on web services. A Taverna
processor corresponds to a web service operation, and the workbench can introspect
over a WSDL file to expand its knowledge of available services (see section 4.1).
Taverna is entensible so that it can exploit web services that conform to a known
interface style, such as Soaplab services or BioMoby services. Kepler and Triana were
originally based on composing local applications, though both have recently been
extended to also include web services using a similar method of introspection. Triana
now uses the Grid Application Toolkit (GAT) to insulate its users from the underlying
middleware technology.

There is significant work in workflows that are not service-based, but explicitly deal
with the moving of data and execution of applications on a heterogeneous and
changing set of computational resources. Pegasus is such a system for planning and
scheduling the execution of large numbers of related jobs on a computational grid.
[35]. Although Taverna‟s development has been more influenced by systems
composing bioinformatics services than workflows on a computational grid it is
interesting to note significant similarities. One common theme in the importance of
the workflows within the lifecycle of an in silico experiment: understanding the
context, taking decisions, monitoring progress, and the feedback of experience.
Another common theme is the value to the scientists of workflows at different levels
of abstraction.

The similarities between scientific and business workflows have led to considerable
interest in adopting robust, standards-based products that have the support of
commercial vendors. However, the immaturity of the workflow area is currently
shown by rapidly moving technology, wide and varied user requirements and shifting
standards.
In the business domain, Web Services is the dominant service technology and the
market-leading workflow standard is BPEL (Business Process Execution Language)
[36]. This was formed by the synthesis of IBM‟s WSFL (Web Services Flow
Language) and Microsoft‟s XLANG. It combines the graph execution model of the
former with the process algebraic model of the later [37]. With the commercial
backing for BPEL it is well known and there have been a number of researchers who
have investigated adapting it to a Grid context [refs]. It should be noted that even in
the business domain, there are alternative standards: BPML [38] and WSCI [39] are
two of the more prominent. Van der Aalst [40] provides a comparison of workflow
systems and standards; work which has lead to the development of YAWL (Yet
Another Workflow Language) [41].

In the scientific domain there is significant uncertainty over the best approach to
OGSA (Open Grid Services Architecture) [42], which combines Web Services with
the experience from Grid technologies, with OGSI [43], WS-RF (WS Resource
Framework) [44], and WS-GAF (Web Services Grid Application Framework) [45].

It has been observed that although there are significant similarities between business
and scientific workflows often have different characteristics. In business the emphasis
is on control flow, while in science it is on data flow. The predominance of dataflow
approaches in scientific workflows implies that for most scientists this is an
appropriate abstraction for many scientists. This fits with the observation that the
designers of scientific workflows are often the domain scientists themselves, while
business workflows are often programmed by IT specialists on behalf of the business
experts. The ICENI system has a slightly different approach to this issue: contrasting
the higher-level spatial expression of a workflow, with the lower-level temporal
expression [46]. The spatial provides a high-level view of a set of concurrent
interacting parts, while the temporal view refines this with temporal relationships.


4 Workflow design
Workflow construction should ideally be placed in the hands of the domain expert.
This corresponds to the role of the researcher in, say, designing a suitable laboratory
protocol for their investigation. In general, however, and particularly in the life
sciences our domain experts are not intimately familiar with the concepts of service
based architectures, let alone the gory details. This immediately creates a gap between,
on the one hand a potentially powerful tool, and on the other hand the tool‟s target
audience. Taverna‟s workflow design methodology therefore has the role of bridging
this gap.

Creating an experimental protocol, whether e-science or bench science, can be broken
down into various stages. In the first stage, the researcher determines the overall
intention of the experiment. This informs a top level design, this would be the overall
„shape‟ of the workflow, including its inputs and desired outputs. This design must
then be translated somehow into a concrete plan. In the lab, this translation would
consist of choosing appropriate reagents, temperatures and analysis protocols. In an e-
science workflow, this maps to the choice and configuration of data and analysis
services.


4.1 service selection
The most immediate problem that users face is therefore one of service selection. In
the lab, the scientist can rely upon experience and past work to choose experimental
parameters and protocols, and may also resort to looking up protocols in laboratory
„cookbooks‟ or consulting with colleagues. The challenge here is to facilitate a similar
process with service selection when composing or modifying workflows.

The first option, that of experience, is obviously only available when the scientists
concerned is familiar with the task of workflow construction. In order to use this prior
knowledge, users require a way in which previous uses of a service can be located and
reused, along with any associated configuration.

A very common task will be to locate a new service based on some conceptual
description of the service semantics. For example, the scientist may require a
component capable of performing a multiple sequence alignment (an operation
whereby similarities across a group of genomic or proteomic sequences may be
identified). The task therefore for us is to allow this kind of semantic querying across
all known service components.

Many issues associated with semantic service discovery are significantly non-
technical. There are serious research challenges to do with the construction of
ontologies to adequately capture task information for a domain as diverse as
bioinformatics [47], let alone all science. However, the main issue observed in
Taverna‟s initial use is a sociological one. The myGrid project has developed a
prototype enriched the UDDI registry service [48], with the ability to store semantic
metadata about services, and has experimented with semantic searches over this store
driven by reasoning engine technology. This technology is functional and
theoretically capable of performing the task of locating services based on a more or
less precise description of their function. The issue here is therefore one of adoption
and tooling. While this technology remains in the research stage its benefits to the
scientists are limited. Although myGrid is a major UK project, it cannot force the
world at large to deploy a myGrid registry service and then register their services with
appropriate metadata. In the future this may change, pioneering work in this area,
including myGrid‟s, will feed through to drive the next generation of standards such
as UDDI, and the „major players‟ in the industry will provide appropriate tools to
make semantic service registration easy enough that people will exploit it. However,
at the present time semantic search technology is insufficiently widespread to be used
„in anger‟ by our domain experts.

Some partial solutions
In Taverna we have taken the simplest possible options for service discovery.
Observing that the vast majority of our target services are not registered with any kind
of programmatically accessible registry system, let alone one enabled with semantic
discovery, we have resorted to a system based on simple web pages. Users can create
a page accessible via HTTP containing links to WSDL documents, when pointed at
this page Taverna will explore all available WSDL documents, extract services and
make them available within the workbench. While crude, this works remarkably well
so long as the gamut of available services is sufficiently small, or the scientist has
already learnt how to filter the list manually. As the standard set of services, available
by default in Taverna, has reached 200-300 a simple search by name facility has been
provided (see Figure 3).

In order to facilitate share and reuse of useful configured services, future work
includes allow users to load a workflow definition into the service selection panel. In
this mode all services in the workflow become available, if selected the service
component will be cloned out of the workflow, leaving metadata and configuration
information intact.

It should be stressed that we are heavily in favor of both registry and semantic search
technologies, the only reason we are not using such is the lack of widespread
deployment, and we look forward to a future where this is not an issue.

4.2 service composition
Once the appropriate service components have been located, the user requires an
interface allowing them to compose these services into a workflow. Thankfully this is
a simpler and more tractable task than that of locating the components in the first
place. There are a variety of tools available that allow users to work with a graphical
representation of the workflow, a form that has a very good fit with the underlying
technology and should therefore be relatively intuitive even to a non expert user.

There is a representation issue to address here. Most if not all workflow design
packages have adopted a view analogous to electric circuit layout, with services
represented as „chips‟ with pins for input and output [27][28][29][30][33][34]. This
electronic circuit arrangement is almost identical to Petri Nets, with chips being
transitions and wires the places between them, so many support Petri Nets as a
standard workflow representation [40]. However, this arrangement can become
intractable above a certain level of complexity, a result known for standard Petri Nets
but addressed in more sophisticated variants. If the layout of service components on
screen is left under the user‟s control then the user can tailor the workflow appearance.
However the disadvantage is that this can result in an increasingly large amount of
time being spent effectively doing graph layout rather than e-science, a clearly
undesirable effect. In Taverna, the graphical view of a workflow is read-only, as it is
generated from the underlying workflow model. One advantage of this is that it is
easy to generate different graphical views of the workflow showing more or less
detail as required.

When composing workflows in an open world, we have no control over the data types
used by the component services. It is entirely likely that a service identified by a
scientist as being suitable does not use the same type as the preceding service in the
workflow, even if the data matches at a conceptual level. The simplest approach is to
only allow exact matches, but this is overly restrictive and makes it very hard to do
anything useful. A serious consideration, therefore, is how a workflow system can
reconcile mismatched types; how much of this can be automatically performed based
on service metadata as opposed to user intervention?

An additional requirement that has become clear in the course of our work with large
workflows is the ability to annotate workflows, service components and other entities
within the workflow (control and data link constructs for example) with arbitrary
notes. This requirement derives from the contextual nature of service functionality.
The same service may perform two different roles in the same workflow from the
point of view of the scientist. This is especially the case with very general services
such as, for example, regular expression based substitution, where the configuration
of the service can alter its semantics.

Service composition solutions in Taverna
The Taverna workbench has two main editing views. The first is a read only graphical
view, for which we use the dot tool from AT&T [12]. This provides a well laid out
graph with the side effect that the view may be exported in a high quality vector
format for publication. The real work of editing the workflow however is performed
from a tree view, where the user can see a quick overview of all the entities in the
workflow. This is not ideal, but has proved usable even with relatively large flows
(over fifty individual components). We have plans in the future to experiment with
more sophisticated representations.

We have addressed the issue of data typing by making a clear decision as to what we
want to know about the types. Internally Taverna uses a carrier data type that is used
to wrap up the real data. This carrier exposes any collection structure and allows the
data to be annotated with both MIME types and full semantic markup using terms
from ontologies [47]. The exposure of the collection structure allows us to do various
type coercion operations, namely implicit wrapping and implicit iteration.

It is frequently the case that services in the bioinformatics domain will process lists of
items. Often, however, we have a single item of the same type, and Taverna will
recognize this disparity and wrap the single item in a list type for processing. The
inverse case is also catered for, with Taverna building implicit iterators. The initial
implementation was for the implicit iterator to iterate the service over the cross
product of all its inputs. In most cases this behaviour corresponds to what users
intuitively expect to happen. (In the case of a single list it corresponds to the higher-
order function map, applying the service to every list element and producing a
corresponding list of results.) However, in some cases, especially where there were
multiple lists from a common source, the required behaviour was for the service to be
applied to the dot product of the input lists. This flexibility is supported through the
use of configurable iterators for each service (see Figure 4) which enable definition of
a tree of cross and dot products to be used in combining the inputs for iteration.

The use of implicit iteration rather than an explicit looping construct, illustrates the
declarative approach to workflow definition in Taverna. This contrasts with a
procedural approach with explicit programming language type constructs for choice
and loops. Several other scientific workflow systems adopt a similar declarative
approach [32][34].


5 Workflow enactment
Once the workflow has been composed, it must be possible to enact the process it
describes. In contrast to the design process, which must by definition be exposed to
the user, it is desirable to hide as much as possible of the enactment machinery. This
in no way suggests that the workflow enactment should appear as a „black box‟
process, informing the user of the progress of any given workflow enactment is a
critical component, but details such as federation and fail-over between workflow
engines, where possible, should not be exposed.

We see workflow enactment as a distinct service, whether actually implemented as
such or by some software API. A critical requirement of this service approach is that
workflow invocation behaviour should be independent of the workflow enactment
service used. This is absolutely vital if properties such as reproducibility and
verifiability are to be maintained; these are critical for the scientific process,
especially because they facilitate peer review of any novel results. This requirement is
of less importance in a business context where a workflow will typically be carefully
negotiated and agreed by the businesses involved, and executed in a known context.
In contrast a scientific workflow will be shared and evolved by a community and
executed by many individuals using their favoured workflow enactment service.

E-Science is a highly diversified field with respect to the requirements it places on
workflow enactment. At one extreme, particle physics experiments produce vast data
sets and corresponding computational loads, with a corresponding requirement to deal
with the machinery of classical high performance computing and networking (HPC /
HPN). The bioinformatics domain, by contrast, is characterized by massive variety in
terms of data types and the resources to operate on them. Workflow enactment
systems in this domain must address different concerns to their HPC counterparts,
with a greater emphasis on flexible service invocation, collaboration and user
interaction within workflows. While any given workflow solution can in theory
handle any given workflow oriented problem; there is a merit in specializing the
solutions to the anticipated problem domain. There is an interesting difference
between the primary sources of variability addressed by Taverna and Pegasus [35]
which deals with more computationally intensive problems. In Taverna the
operational services, which are fixed at particular locations, can vary. Users may wish
to modify their workflows because new “better” services have become available, or
previous ones are no longer on-line. In Pegasus, the “operational services” as
applications owned by the experimental scientists, do not vary. However, they are not
fixed to a specific location and deciding which services should run where, in the
context of a changing pool of computational resources, is a major issue.

Workflows in e-science may also vary hugely in terms of expected invocation
duration. Tavena Workflows implemented up to this point vary between two seconds
and two weeks of runtime, with the longest designed (not implemented due to an
absence of services) having an anticipated runtime of almost a year. The potential to
invoke workflows over this kind of time scale imposes a strong requirement on any
such system to keep the user informed as to the progress of their enactment, to allow
the user to interact with the running workflow in terms of inspecting intermediate
results, manually cancelling substructures within the workflow or similar operations,
and to do all this from any physical location with preferably minimal network
connectivity. Ideally these operations would be possible from a wireless enabled PDA:
accessing data and workflow progress from a relatively low cost hand held device
would be an attractive prospect in many life sciences laboratories.


5.1 Fault tolerance and resilience
Any component based architecture, where the components are not under a single
controlling authority, is going to contain components that will, at some point, fail. It is
therefore the responsibility of the aggregating system, in Taverna‟s case the workflow
enactment engine, to handle such failures, retaining data integrity and making a
reasonable „best effort‟ to proceed with the invocation. Should this be impossible, it is
critical that the reporting functionality is sufficient to inform the user of exactly why
the workflow was unable to complete.

We can classify failure modes of a workflow system broadly into the following
categories: failure of the enactment engine itself, failure of component services and
failure of network fabric. These could be further subdivided but these categories are
sufficient for an initial analysis.


Failure of the enactment engine
By introducing a single point to which workflows are submitted, and from which
status reports and results may be obtained, we introduce a single point of failure. This
may or may not be a problem; if the workflow engine is running on some massively
redundant hardware the chance of it failing might be so low as to be insignificant. The
most common case, however, is that the workflow service will be either running on a
workstation or some other piece of commodity hardware, and may have a significant
chance of failure, particularly during the invocation of long running workflows. This
places a requirement on the system to be able to deal with problems varying from
software failures (the enactor or other) to complete system failure (the computer
running the enactor catches fire).

Possible solutions
By using a peer to peer architecture for the enactment engine service it is possible to
replicate state intelligently between enactors within a peer group. This renders the
enactment process almost immune to single point failures at the engine level. In our
case our future implementation of this technology relies on the simple serialization of
the workflow state into XML, a capability we originally built in to allow workflow
checkpointing.

Failure of services
Service failure is more complex than enactor failure. For example, if a service is down
because the machine it runs on has failed, it is probably worth trying again later. If the
service failed because your input data was invalid it certainly isn‟t. While some
transports can signal this distinction, in general the toolkits used to generate the
services do not provide this facility.

In addition to simply retrying the service invocation, it may be possible to locate or
have specified an alternate second service to try should the original fail. In an ideal
world this could be inserted automatically, however, the same issues that affect
semantic service discovery apply here, namely the lack of available service metadata.
In addition, initial Taverna users have expressed scepticism at the prospect of
automatic selection of services that they would consider equivalent, except in the
simplest cases. In the life many web resources that have very similar functionalities,
but people distinguish between them based on knowledge of the experimental context
and their personal preferences. At the present time a simpler practical approach is a
mechanism in the workflow designer application, and corresponding feature in the
language, which allows the scientist to explicitly state that one service may act as an
alternate for another.

Possible Solutions
We believe that current technology is unable to automatically locate alternate services.
In Taverna, users can specify an alternate or alternates explicitly for any given Scufl
processor. Because of the context sensitive nature of some services it may be
impossible in the general case to perform this substitution automatically, and at the
very least would be extremely difficult.

Standard fault tolerance techniques such as retry and exponential back out of retry
times were implemented in response to user feedback requesting greater resilience in
the context of unreliable services.

Failure of network fabric
This is actually similar to failure of services, but with the additional aspect that it may
be possible to probe the network connectivity to a service host using standard internet
protocols. This is distinct to a typical web service, which provides no facility to check
whether the service is live or not. One common network fabric failure would be the
case of the enactor running on a laptop which is moved away from its wireless
network – the enactor is not always running on a server machine.

5.2 Reporting
Given the variety of failure modes and potential remedies, reporting on the progress
of a workflow is a complex task. Various metadata about service invocation is simply
unavailable in the general case, for example there is no standard interface that allows
a service to define how far through a given invocation it has reached, so a progress
display, something users are used to seeing on desktop application software, is either
non trivial or impossible.

There is significant information available, however, and therefore a presentation issue
of exactly how to show this to the user. When a workflow may contain fifty or more
processing components, and each of these components can be retrying, using alternate
implementations etc the complete workflow state is highly complex, and yet we
require a visualization of it that allows the user to see at a glance what is happening,
acquire intermediate results where appropriate and control the workflow progress
manually should that be required.

Reporting solutions in Taverna
Related to our provenance collection, the reporting mechanism consists of a stream of
events for each processing entity, with these events corresponding to state transitions
of the service component. For example, we emit a message when the service is first
scheduled, when it has failed for the third time and is waiting to retry etc. These
message streams are collated into an XML document format and the results presented
to the user in tabular form, e.g. Figure 5.

6 Metadata and Provenance
Scientists are, reasonably, interested in the results of any given experiment. This
interest, however, goes beyond examining the results themselves and extends to the
context within which those results exist. Specifically, the scientist will wish to know
from where a particular result was derived, which process was used and what
parameters were applied to that process. One can therefore distinguish between
provenance of the data and provenance of the process, although obviously the two are
linked.

The primary task for data provenance is to allow the exploration of some novel result
and the determination of the derivation path for the result itself in terms of input data
and intermediate results en route to the final value. This requirement motivates
architecture whereby the „side effect‟ information generated during a workflow
invocation may be stored and queried. We define side effect information as anything
that could be recorded by some agent observing the workflow invocation, and
implicitly or explicitly links the inputs and outputs of each operation within the
workflow in some meaningful fashion.

Process provenance is somewhat simpler than data provenance, and is more similar to
traditional event logging. It is complicated somewhat by the requirements for fault
tolerance and the correspondingly larger range of possible events that may occur. It is
critical that in the event of a failure of some kind the user can be informed in such a
way that they can comprehend the failure and take appropriate action where required.

Provenance collection with semantic web and LSIDs
One area of active investigation is the combination of the Life Science ID (LSID)
[Ref] scheme with semantic web technologies such as RDF to describe the
provenance of data resulting from a workflow invocation. The basic scheme is that the
workflow engine communicates with a data store that is also an LSID authority. All
inputs, outputs and intermediate results are stored in the data store and given an LSID.
As LSIDs uniquely identify a data value, this gives a simple method of talking about
the data in RDF statements, and making the data available to other software that can
resolve LSIDs (e.g. IBM‟s LSID launchpad plugin for Internet Explorer). Figure 6
illustrates this is an example of the combination of Taverna (including Freefluo) with
other components. Side effect knowledge in the form of RDF statements is collected
every time a service is invoked. At the base level, the statements will express that
„result a was produced by process b on inputs c, d, and e‟. At the data provenance
level, for each output-input pair there can be a statement, e.g. „b was created using a‟.
Further work involves extending workflows with additional provenance templates that
with be filled with values, creating RDF statements as the workflow runs. For
example, it would be possible to collect additional, more specialized, information
such as „result b is the predicted structure of input a‟.

The sheer volume of side effect data collected in this fashion will create its own issues,
there are research avenues opening in intelligent summarizing systems that could
convert these data back into natural language, for example.

7 Discussion
The GGF10 Grid workflow workshop [1] identified a number of issues shared by
several Grid workflow initiatives. The experiences of Taverna‟s development and
initial use strike chords with the reports from others.

It is almost impossible to underestimate the importance of ease of use. Scientists are
interested in their scientific problems, and rightly have little interest in grid
technology for its own sake. They prefer solutions that fit with their working
environment but provide them extra benefits. It is worth noting that many of the initial
requests from Taverna users were for ease of use improvements rather than cutting
edge technology: better support for browsing results, handling graphical data, user
rather than developer-oriented documentation etc.

The semantic grid, which combines the grid and semantic web [49], is a useful base
technology but scientists need the right methods and approaches to exploit it. Grid
technologies are strong in easier and more flexible access to resources, while the
semantic web offers capabilities for the better management of the information and
knowledge gained from a set of experiments. Within myGrid and Taverna semantic
grid techniques are used to provide explicit recognition and support for the e-science
process. The foundation of this is an experimental lifecycle that recognises the
importance of metadata about experiments, including provenance. A developing an
environment for scientists key questions include:

1. What service metadata is needed for discovering resources?

2. What metadata can be used to guide combining services into experiments?

3. How can user or organisational metadata be used to personalise previous/generic
    experiments to new contexts?
4. What monitoring metadata (provenance) is required for managing ongoing
   experiments?

5. What provenance should be available to explain the origin of results, and the
   useage of resources in the generation of results?

6. How can provenance and other metadata promote more effective sharing with
   colleagues (and what security metadata is needed to manage this)?

Taverna provides some initial solutions, but there is considerable further work in
developing those and experimenting with new ones. In myGrid there has been
additional work that concentrates on service and workflow metadata, i.e. 1,2,3 and 6
above, [50] and metadata for managing experiments and results, i.e. 4 and 5 above,
[11]. In addition, this is both within the context of the life sciences and in adapting the
solutions to other domains.

The architecture or component model of a workflow system is significant. Not just
technically but in how it fits with users‟ conceptual model of their domain. The
openness of Taverna is based on the number and variety of research groups in the life
sciences, and the culture of many groups providing open source tools/services to
access their data and applications. The emphasis on web rather than grid services is
based on the current available services that users want to exploit.

There is initial evidence of a link between a scientific workflow system that is
effective in a scientific domain and the experimental model of that domain. In biology
much reasoning is by comparison, and there is no time component to many of the in
silico experiments. This means that Taverna‟s lack of explicit support for composing
components that are continuous time simulations is not a significant weakness.
Further in biology there is no continuous mathematical model for most areas and
techniques such as parameter sweeps are not widespread.

The Williams-Beuren Syndrome workflow [11] is an exemplar exploratory workflow.
It tries to locate and characterise unknown genes, and for this produces a range of
results that need expert interpretation. Without Taverna the “workflow” took about 2-
3 days by hand, with the scientist analysing the results as they went along. With
Taverna a more thorough and extensive generation of results takes 3-5 hours, but the
expert analysis of these results still takes 1-2 days. In these circumstances the
application of further HPC techniques to reduce the 3-5 hours down to 15-25 mins
would have limited overall impact. This style of experiment is common in the life
sciences; it contrasts with computationally expensive analysis of large, structured
input data to produce comparably smaller and less diverse results. The later
computational style has been characteristic of workflows (e.g searching for
gravitational waves or pulsars) that have successfully exploited computational grids
[34][35]. A diversity of experimental styles is inevitable across different scientific
disciplines, the challenge is how to effectively share experience from workflow
development and use in different contexts.

In comparing the nature of workflows in different domains the initial dimensions
proposed are typically: execution time, data size and number of composed services.
This is just part of a larger set of dimensions. Others include: whether the data has any
predictable structure (time series, spatial cooordinates, matrix structures), how
expensive the input (or intermediate) data is to recreate, how the outputs will be
analysed, the nature of the services themselves, and the relationships between the
workflow users and the service providers. In the nature of services the factors of
interest include: experimental or well-established, available at one fixed location or
distributed, local to a research group or available to a wide community, conforming to
any formal or informal standards. It is clear that the Taverna experience only covers a
small part in the range of possibilities, and the initial impression is that other systems‟
experience if of similar small, but different, parts.

Notes

semantic web service composition (here rather than in related work because it was not
a big feature of the GGF workshop)
ACKNOWLEDGEMENTS


The authors would like to acknowledge the help of the Taverna users: in particular
Keith Hayward, Claire Jennings, Kate Owen and Simon Pearce from the Institute of
Human Genetics, University of Newcastle upon Tyne, International Centre for Life,
NE1 3BZ, and Hannah Tipney and May Tassabehji from University of Manchester
Academic Unit of Medical Genetics, St Mary‟s Hospital.

This work is supported by the UK e-Science programme EPSRC GR/R67743. The
                                        my
authors would like to acknowledge the    Grid team: Nedim Alpdemir, Rich Cawley,
Tracy Craddock, Neil Davis, Alvaro Fernandes, Robert Gaizaukaus, Carole Goble,
Chris Greenhalgh, Yikun Guo, Keith Hayward, Anath Krishna, Phil Lord, Simon
Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Norman Paton,
Milena Radenkovic, Peter Rice, Nick Sharman, Robert Stevens, Victor Tan, and Paul
Watson; and our industrial partners: IBM, Sun Microsystems, GlaxoSmithKline,
AstraZeneca, Merck KgaA, geneticXchange, Epistemics Ltd, and Network Inference.
REFERENCES



[1] GGF10 Workflow workshop, March 2004, http://www.extreme.indiana.edu/groc/ggf10-ww/ [03
    June 2004] (could be replaced with reference to special issue overview article?)

[2] UK National e-Science Centre (NeSC) workshop on “e-Science Workflow Services”, Edinburgh,
    Dec 2003, http://www.nesc.ac.uk/action/esi/contribution.cfm?Title=303 [O4 February 2004]

[3] Taverna project. http://taverna.sourceforge.net [03 June 2004]

[4] Oinn T, Addis M, Ferris J, Marvin D, Greenwood M, Carver T, Pocock MR, Wipat A, Li P.
    Taverna: A tool for the composition and enactment of bioinformatics workflows accepted for
    Bioinformatics, 2004

[5] Goble,C.A., Pettifer,S., Stevens,R. and Greenhalgh,C. (2003) Knowledge Integration: In silico
    Experiments in Bioinformatics, in The Grid 2: Blueprint for a New Computing Infrastructure
    Second Edition eds. Ian Foster and Carl Kesselman, November 2003

[6] myGrid - directly supporting the e-scientist http://www.mygrid.org.uk [02 June 2004]

[7] Freefluo workflow enactment engine. http://freefluo.sourceforge.net [03 June 2004]

[8] Oinn,T., Addis,M., Ferris,J., Marvin,D., Greenwood,M., Wipat,A., Li,P. and Carver,T. (2004)
    Delivering Web service coordination capability to users. Accepted as short note and poster for
    WWW2004.

[9] Stevens,R., Glover,K., Greenhalgh,C., Jennings,C., Pearce,S., Li,P., Radenkovic,M. and Wipat,A.
    (2003) Performing in silico experiments on the Grid: a users perspective. Proc UK e-Science All
    Hands Meeting 2003, 43-50.

[10] Addis,M., Ferris,J., Greenwood,M., Li,P., Marvin,D., Oinn,T. and Wipat,A. (2003) Experiences
    with e-Science workflow specification and enactment in bioinformatics. Proc UK e-Science All
    Hands Meeting 2003, pp. 459-466.

[11] Stevens R, Tipney HJ, Wroe C, Oinn T, Senger M, Lord P, Goble C, Brass A, Tassabehji M.
    Exploring Williams-Beuren Syndrome Using myGrid, accepted for proceeding ISMB 2004,
    Glasgow, July 2004.

[12] Gansner ER, North SC, (1999) An open graph visualization system and its applications to software
    engineering. Software Practice Experience, 2000(S1), 1-5.

[13] Stein,L. (2002) Creating a bioinformatics nation. Nature, 417, 119-120

[14] Booth D, Haas H, McCabe F, Newcomer E, Champion M, Ferris C, Orchard D. (2003) Web
    Services Architecture. W3C http://www.w3.org/TR/ws-arch/ [06 January 2004]

[15] Wang,L., Riethoven,J.J. and Robinson,A. (2002) XEMBL: distributing EMBL data in XML
    format. Bioinformatics, 18, 1147-8.
[16] Senger,M. (2002) Bibliographic query service. http://industry.ebi.ac.uk/openBQS/

[17] Miyazaki,S. and Sugawara,H. (2000) Development of DDBJ-XML and its application to a
    database of cDNA. Genome Informatics. Universal Academy Press, Inc (Tokyo), pp. 380-381.

[18] Wilkinson,M.D. and Links,M. (2002) BioMOBY: An open source biological web services
    proposal. Briefings in Bioinformatics. 3, 331-341

[19] Rice,P., Longden,I. and Bleasby,A. (2000) EMBOSS: the European Molecular Biology Open
    Software Suite. Trends Genet., 16, 276-7.

[20] Senger,M., Rice,P. and Oinn,T. (2003) SoapLab - a unified Sesame door to analysis tools. Proc
    UK e-Science All Hands Meeting 2003

[21] Kawashima,S., Katayama,T., Sato,Y. and Kanehisa,M. (2003) KEGG API.
    http://www.genome.ad.jp/kegg/soap/

[22] Eckart,J.D. and Sobral,B.W. (2003) A life scientist's gateway to distributed data management and
    computing: the PathPort/ToolBus framework. OMICS, 7, 79-88.

[23] Greenwood, M., Wroe, C., Stevens, R., Goble, C., Addis, M. "Are bioinformaticians doing e-
    Business?" in "The Web and the GRID: from e-science to e-business", proceedings of Euroweb
    2002, Oxford, UK, 17-18 Dec 2002, Eds. Brian Matthews, Bob Hopgood, Michael Wilson,
    Electronic Workshops in Computer Science, British Computer Society ISBN 1-902505-50-6
    http://www.bcs.org/ewic

[24] R.D. Stevens, C.A. Goble, P. Baker, and A. Brass. A Classification of Tasks in Bioinformatics.
    Bioinformatics, 17(2):180-188, 2001

[25] M. Chagoyen, M. E. Kurul, P. A. De-Alarcón, J. M. Carazo, and A. Gupta, Designing and
    executing scientific workflows with a programmable integrator, accepted for Bioinformatics 2004;
    doi:10.1093/bioinformatics/bth209

[26] Cavalcanti, M., Baião, F., Rössle, S., Bisch, P., Targino, R., Pires, P., Campos, M, Mattoso, M..,
    “Structural Genomic Workflows Supported by Web Services”, In: Proceedings of the 14th
    International Conference on Database and Expert Systems Applications (DEXA 2003),
    International Workshop on Biological Data Management (BIDM‟03), Prague, Czech Republic, sep
    2003, pp. 45-49

[27] BioOpera. Process Support for Bioinformatics.
    http://ikplab12.inf.ethz.ch:8888/bioopera/website/main.html [04 June 2004]

[28] INCOGEN VIBE (Visual Integrated Bioinformatics Environment). http://www.incogen.com/VIBE
    [04 June 2004]

[29] TurboWorx Enterprise – http://www.turboworx.com, [03 June 2004]

[30] Pipeline Pilot - http://www.scitegic.com/products_services/pipeline_pilot.htm, [03 June 2004-06]
[31] Stefan Frank, Josh Moore, and Roland Eils, A question of scale: Bringing an existing bio-science
    workflow engine to the grid, GGF10 workflow workshop March 2004,
    http://www.extreme.indiana.edu/groc/ggf10-ww/bringing_existing_bio-
    science_workflow_engine_to_grid__stefan_frank/GGF10_frank-moore_questionofscale.pdf [03
    June 2004]

[32] Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludaescher, Steve Mock,
    Kepler: Towards a Grid-Enabled System for Scientific Workflows, GGF10 workflow workshop
    March 2004, http://www.extreme.indiana.edu/groc/ggf10-ww/kepler_towards_grid-
    enabled_system_for_scientific_workflows__bertram_ludaescher/kepler-GGF10.doc [03 June 2004]

[33] The Ptolemy Project, http://ptolemy.eecs.berkley.edu [11 May 2004]

[34] Shields, M. and Taylor, I. Programming Scientific and Distributed Workflow with Triana Services,
    GGF10 workflow workshop March 2004, http://www.extreme.indiana.edu/groc/ggf10-
    ww/programming_scientific_and_distributed_workflow_with_triana_services/TrianaWorkflow.pd
    f [03 June 2004]

[35] Gil Y, Deelman E, Blythe J, Kessleman C, Tangmunarunkit H. Artificial Intelligence and Grids:
    Workflow Planning and Beyond, IEEE Intelligent Systems special issue on e-science, 19:1 (2004)
    26-33

[36] Andrews, T., Curbera, F., Dholakia, H., Goland, Y., Klein, J., Leymann, F., Lui, K., Roller, D.,
    Smith, D., Thatte, S., Trickovic, I., Weerawarana, S. Business Process Execution Language for
    Web Services. http://www-106.ibm.com/developerworks/library/ws-bpel/

[37] Curbera C, Khalaf R. Implementing BPEL4WS: The Architecture of a BPEL4WS Implementation,
    GGF10 workflow workshop March 2004, http://www.extreme.indiana.edu/groc/ggf10-
    ww/implementing_bpel4ws__rania_khalaf/Implementing%20BPEL4WS.pdf [03 June 2004]

[38] Arkin, A. Business Process Modeling Language. http://www.bpmi.org/bpml.esp

[39] Arkin, A., Askary, S., Fordin, S., Jekeli,, W., Kawaguchi, K., Orchard, D., Pogliani, S., Riemer, K.,
    Struble, S., Takacsi-Nagy, P., Trickovic, I., Zimek, S. Web Service Choreography Interface.
    http://wwws.sun.com/software/xml/developers/wsci/

[40] van der Aalst,W. (2003) Don‟t Go with the Flow: Web Services Composition Standards Exposed.
    IEEE Intelligent Systems. Jan/Feb 2003, 72-76

[41] W.M.P. van der Aalst and A.H.M. ter Hofstede. YAWL: Yet Another Workflow Language (Revised version).
    QUT Technical report, FIT-TR-2003-04, Queensland University of Technology, Brisbane, 2003, accepted for
    publication in Information Systems, available at http://www.citi.qut.edu.au/yawl/yawldocs.jsp, [08 June 2004]

[42] Foster, I. et al. (2002). The physiology of the Grid: An open Grid services architecture for
    distributed systems integration., Technical report of the Global Grid Forum.
[43] OGSI working goup. Final OGSI Specification V1.0, February 2003, available at
    https://forge.gridforum.org/docman2/ViewProperties.php?group_id=43&category_id=392&docum
    ent_content_id=347 [08 June 2004]

[44] The WS-Resource Framework, http://www.globus.org/wsrf/ [08 June 2004]

[45] Savas Parastatidis, Jim Webber, Paul Watson, Thomas Rischbeck. WS-GAF: A Framework for
    Building Grid Applications using Web Services, submitted to communication and concurrency:
    practice and experience, March 2004.

[46] Anthony Mayer, Steve McGough, Nathalie Furmento, William Lee, Murtaza Gulamali, Steven
    Newhouse and John Darlington Workflow Expression: Comparison of Spatial and Temporal
    Approaches, GGF10 workflow workshop March 2004,
    http://www.extreme.indiana.edu/groc/ggf10-
    ww/workflow_expression_comparison_of_spatial_and_temporal_approaches__anthony_mayer/Lo
    ndonWflowBerlin.pdf [03 June 2004]

[47] Wroe, C. Stevens,R., Goble,C., Roberts,A. and Greenwood,M. (2003). A Suite of DAML+OIL
    Ontologies to Describe Bioinformatics Web Services and Data. International Journal of
    Cooperative Information Systems 12: 597-624.

[48] Lord,P., Wroe,C., Stevens,R., Goble,C., Miles,S., Moreau,L., Decker,K., Payne,T. and Papay,J.
    (2003) Semantic and Personalised Service Discovery. In W. K. Cheung and Y. Ye, editors,
    Proceedings of Workshop on Knowledge Grid and Grid Intelligence (KGGI'03), in conjunction
    with 2003 IEEE/WIC International Conference on Web Intelligence/Intelligent Agent Technology,
    pp. 100-107, Halifax, Canada, October 2003.

[49] De Roure D, Hendler JA. E-Science: The Grid and the Semantic Web, IEEE Intelligent Systems
    special issue on e-science, 19:1 (2004) 65-71

[50] Wroe C, Goble C, Greenwood M, Lord P, Miles S, Papay J, Payne T, Moreau L. Automating
    Experiments Using Semantic Data on a Bioinformatics Grid, IEEE Intelligent Systems special
    issue on e-science, 19:1 (2004) 48-55




FIGURES
Figure 1 The Taverna Workbench.
Figure 2 Example Scufl workflow. This workflow uses a number of different Scufl processor types, e.g.
WSDL Web service operations (green boxes) and Soaplab services (yellow boxes). The overall workflow
input is displayed at the top and the outputs along the bottom of the diagram.
Figure 3 Example Palette of available Services, illustrating search facility
Configurable Iterators




Figure 4 Configurable iteration strategy



Status information




Figure 5 Status information
Taverna and Ouzo architecture




Figure 6 Architecture overview of Taverna and Freefluo working in conjunction with the Ouzo/MIR
storage server that provides an LSID authority, and support the creation and storage of provenance
information
Figure 7 Example workflow model explorer view

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:12
posted:7/6/2010
language:English
pages:35