ICR Workflow Working Group, 2008
Document Change History
Version Number Date Contributor Description
1.0 WFWG 2008 Team Initial Draft
1.1 November 24 Scott Oster, Brian Comments / Suggestions
1.2 December 4 WFWG 2008 Team Modifications based on comments from
1. Contents ....................................................................................................................... 1
2. Introduction................................................................................................................... 1
3. Summary ...................................................................................................................... 2
4. Lessons Learned and Recommendations .................................................................. 2
5. Workflow Implementation .......................................................................................... 11
6. Group Members ......................................................................................................... 11
This document is the result of direct activities within the Integrative Cancer Research
Workflow Working Group (ICR WFWG) and serves as artifact 2 as defined in the ICR
Workflow Working Group 2008 charter. For reference, this charter can be found at
After conducting a thorough analysis of existing services (both data and analytical)
within caBIG 1, the group considered workflows within the proteomics domain. There
was some discussion on creating a workflow in the microarray domain but to avoid
duplicating previous efforts 2, this was ruled out by the group.
The WFWG grouped the main issues encountered when trying to create a proteomics
workflow 3 into 5 categories. A combination of these issues forced the group to revise
These categories are:
1. Lack of Analytical Services.
2. Service Stability.
3. Services Using Older Versions of caGrid
4. Need for Shim/Translation Services.
5. Statistics For data-type reuse and User Experiences (when using services).
4. Lessons Learned and Recommendations
Lack of Analytical Services
The Workflow Working Group chose a proteomics workflow for implementation, since
the group, in its previous efforts had successfully implemented a genomic microarray
based workflow. The proteomic workflow was designed to analyze mass spec profiles
of samples and produce a list of protein ids with extensive genomic and proteomic
annotation. Therefore, the workflow required both analytical and data services including
a proteomic data repository for experimental data, an analytical tool to analyze mass
spec profiles of different samples, and an id conversion tool to extensively annotate the
protein ids with genomic and protein information. The tool cataloging project was the
starting point for identifying grid enabled services with the above described functionality.
However, one of the obstacles in implementing the workflow was the lack of grid-
enabled, actively maintained services, particularly analytical services. Within the
proteomic domain, there was a lack of robust tools similar to GenePattern or
GeWorkbench in the microarray workspace. R proteomics was the only analytical tool
CaBIG Tool Cataloging Project,
2007 Workflow Services Used, https://gforge.nci.nih.gov/docman/view.php/332/10843/workflow services.doc
Proposed Proteomics Workflow,
Actual Proteomics Workflow,
available; yet, it is no longer under development and is not caGrid 1.x enabled. In
contrast, data services such as caBIO and Grid PIR are actively
supported and provide comprehensive functionality. A number of additional proteomic
use cases were considered, none of which could be implemented due to a lack of
Recommendation: A gap analysis is performed to document the domains and tools
that should be considered for future caBIG development. These domains and tools can
be/should be driven from the ICRi working group.
Subsequent to implementation of this workflow, there have been discussions between
caBIG and myGrid (UK) to port selected myGrid services to caGrid. Since Taverna
workflows can be created using services from myGrid and caBIG, the pros and cons of
porting myGrid services to caBIG should be further explored. For example, some of the
issues that should be considered are:
- is it redundant to port already Taverna compatible myGrid services to caGrid
- can existing myGrid services be used without porting to caGrid
- are there standards within myGrid and how does this fit into the caBiG model
- would stringing together myGrid and caGrid services necessitate creating excessive
number of shim services - what are the technical difficulties in crossing grids?
Currently, there appear to be a large number of services in myGrid and numerous
workflows have been implemented using Taverna; therefore, a study of myGrid and
caBIG compatibility may identify specific gaps in caBIG analytical services that could be
filled by leveraging services in myGrid.
One issue that complicates the development of workflows is the instability of caGrid
services. Services are often down or crash. This requires the workflow developer to
contact the service maintainer to restart the service, and significantly slows workflow
development. This has affected the workflow team during development in 2007 and
2008, most recently with the CPAS service crashing when it is queried.
We don‟t know all of the causes of these crashes, but the lack of visibility of service
problems contributes to the instability of services. Service maintainers don‟t know when
their services crash or throw exceptions unless they manually log into their server and
check the processes and log files.
One cause is well known: large datasets sent as input will crash almost all caGrid
services. The XML parsing performed by the caGrid architecture uses the DOM
approach. This requires the entire XML object structure to be in memory at one time
and thus has a very large memory footprint. Consequently, sending large datasets
results in the server running out of memory. It doesn‟t take a very large an input dataset
to cause this to happen – for example, the GISTIC caGrid service can be crashed using
an input file with 100,000 data points (this is less than a 200k binary input file).
caGrid Transfer has been developed to help with large datasets, but this adds other
issues. caGrid Transfer does not specify how the data should be formatted for transfer
between the client and server. If you simply serialize the caGrid objects to XML, then
you have not solved the problem – when your server reads the XML objects out of the
transferred file, it will deserialize the XML objects using DOM and run out of memory.
This was demonstrated by the team at Washington University School of Medicine:
If you pass the objects in a proprietary, non-XML format, then your server does not run
out of memory, but all clients must use your utilities to serialize and deserialize the data.
In addition, this architecture challenges a core concept of caGrid, since the data being
transferred is in a format that is not defined in the caDSR.
Using caGrid Transfer is non-trivial. In particular, it is not readily apparent how to return
results from the server to the client when using caGrid Transfer. The ASBP team is
working to document how to do this.
Recommendation: Several actions could be taken to address this issue, including
automatic restart, improved monitoring of services, and better handling of large
Additional tooling should be developed to automatically restart services when they
crash. This would increase the uptime for the service, and reduce the amount of
communication required between a workflow developer and the service maintainer. An
example of a successful implementation of this architecture is the Orbix Daemon that
Iona Technologies developed for its CORBA Orb – when an Orbix service crashes, the
Orbix Daemon will restart it.
A central monitoring service could be developed to provide a dashboard of servic e
health. Requirements for this monitoring service include:
display of service health (up or down),
provide subscription to receive notification when a service goes down,
provide a log of exceptions thrown by the service 5, and
provide a customizable dashboard, so that a service developer would have quick
access to the health of his services of interest and not have to navigate through
large lists of services.
This dashboard would help provide visibility to service problems to both the service
maintainer and to the caGrid community at large. Problematic services could be easily
identified and prioritized for additional development work.
Clearly the central log could not capture exceptions that result when the caGrid service can not contact the logging
From a „business‟ perspective, service stability is key to the adoption of caGrid
technology and the development of robust workflows. Institutions will be reluctant to
invest in the development of workflows that rely on a service layer that cannot be
guaranteed. To help mitigate this issue, we recommend that caBIG should actively
encourage the replication and maintenance of a production service at a minimum of
three independent institutions. This will provide a level of redundancy that the workflow
tools should be able to exploit to help create a more resilient production environment.
Finally, additional tool development should take place to improve the handling of large
datasets, enable the possibility of load balancing between replicated services and
permit the automatic fail over of submitted jobs between replicated services.
Services Using Older Versions of caGrid
Some services used caGrid 0.5 (which was a prototype release) and were not
compatible with caGrid 1.0 and newer versions. An example is RProteomics, which was
developed by Duke University. This was funded by caBIG and was an early adopter of
caGrid 0.5. This service met the silver level of caBIG compatibility when it was
developed however there has been no continuing support since 2007 to upgrade the
service to the current version of caGrid. For these services, manual upgrade is
necessary if the service is still of value.
caGrid v1.0 and on is designed to be fully backward compatible. Services can be
automatically upgraded to the latest version via Introduce. However, we observe that it
may take no-trivial effort to upgrade the services. Upgrading services from caGrid 1.1 to
1.2 was problematic on Mac machines. More specifically, bugs were encountered in
Introduce that prevented the upgraded services from advertising their semantic
metadata. After unsuccessfully working with developers at OSU to seek a possible
workaround, we had to regenerate the services and copy and paste code into them.
This took approximately 2 weeks of work. This problem may have been specific to Intel
based Mac users.
Recommendation: There are several ways to address this issue. Software bugs in the
current version of Introduce need to be fixed and tested. In addition, when caBIG core
functionality undergoes a major revision, all existing tools and applications will need to
be re-evaluated within the context of this release. Compatibility levels need to be
qualified with respect to a specific version of caGRID/CORE. For example, the
RProteomics service is classified as “Silver Level” compatible. This status was obtained
using caGRID 0.5 and no longer appears valid. One open issue we identified is who will
be responsible for maintaining caGrid services, as there seems no clear solution if no
explicit funding is available. Another suggestion is to explore the possibility of using
statistics to track which services need to be upgraded.
Need for Shim/Translation Services
As new consumers come to the grid, some of the initial questions they may have are:
1. “What services exist?”
2. ”Do the services that exist solve the particular problem I'm trying to address?”
3. ”How can I string these services together to create my workflow?”
If we assume both 1 and 2 are satisfied, the expectation is that 3 would be possible as
long as the data types used between services in the workflow are semantically and
syntactically interoperable. The issue we have seen is that some data types, while the
semantically similar, are not the same syntactically and therefore cannot be used
between services without the use of a translation, or “shim” service.
As a concrete example, when trying to extract bioassay data from the caArray 1.6 grid
service and passing this to a preprocessing and clustering analytical service, the data
type package names were similar but not exact between these services, forcing the
need for data type translation using an extra shim service.
In general, the expectation is (and should be) that these syntactic differences are
caught during model registration and if the semantics of two data types are similar, a
common data type is created to address the needs of both. One issue is that no tooling
exists to find reused models and XML schemas. People work from their own
perspective. Getting to reuse and interoperability is part of gold review. In the end, it is
easier to invent than reuse has to be addressed.
The concept of a “Domain Analysis Model” has been introduced, which can be viewed
as an all-encompassing model of existing grid services in a particular domain where
overlapping/similar data types are unified, reducing the need for these shim services.
While the Domain Analysis Model will be useful for new services, shim services will still
be needed for existing services on the grid. More specifically, forcing service producers
to upgrade their services to conform to the Domain Analysis Model may prove difficult
and the need for shim services will still exist.
Recommendation: To avoid multiple different shim services used on the grid, we
propose a Generic Shim Service to translate between data types that differ syntactically.
Like all other grid services, this service will be compatibility reviewed. Consumers
wishing to develop their own shim services can still do so but the core infrastructure of
caGrid should offer a generic shim service to handle some data type translation (the
specifics of which types will need to be revisited).
Requires guidance to strike a balance between shims for a few services vs between all
services. Determine the reason and id the few scenarios where this is worthwhile.
Grid Service Analytics
Use of common data elements (CDE) to support interoperability and workflow
Central to the function of caBIG is the use of common data elements (CDE) to describe
the information it consumes and produces. The structure of a CDE is outlined in Figure
1: Structure of a Common Data Element
Since 2005, caBIG has attempted to define a
set of standard set of CDE. These represent a
range of concepts and can be accessed from
the data standards repository browser
Figure 1: Structure of a Common Data Element
The following table describes the areas where caBIG have attempted to describe a set
of standard CDE definition:
Body Mass Index Genomic Identifers Rep of Date
Body Surface Area Language Rep of Time
CDC Ethnicity Code Mailing Address Sex and Gender
CDC Race Code Marital Status Category Social Security Number
Education Level Organization Spec of Ethnicity
Email Address Person Age Spec of Race
Family Relationship Person Name Telephone Number
Func Perform Status Person Religion Designation
Theoretically, these standards can greatly facilitate the composition of scientific
workflows, which involve service discovery and service inter-connectivity. During the
process of trying to define example workflows a number of issues were encountered
that pertain to the creation and discovery of common data elements as they relate to
services that might be included within a workflow.
Currently it appears that very few new services are actually using the defined standards.
Shirononoshita et al (2008), reported that only 6% of the available CDE are compatible
with linking multiple services, and half of these elements could only connect two data
To help visualize the issue, consider the concept of Gender, at the time of writing there
are nineteen CDE elements that describe this concept across approximately thirteen
projects. The following table describes a subset of these elements:
CDE Description Public ID Project
Identification Patient Gender java.lang.String 2475587 caTIES 2.0
Patient Gender java.lang.String 2388916 caTIES 2.0
Person Gender java.lang.String 2513152 Potential CDEs for Reuse
Participant Gender Person Gender Text Type 2513661 caTissue_Core_caArray
Lung Neoplasm Gender Patient Gender Category 2673983 BiospecimenCoreResource
Thus, based on these definitions, it would be impossible to connect the respective
services as they do not share a common „Gender‟ element. Current practice, defines
service data elements at a very granular, unique level. This severely restricts the ability
of the workflow designer to discover and connect services.
As mentioned, a key facet of the workflow composition is the discovery of services.
Current caBIG service organization is somewhat biased towards a particular „tool or
application‟. While this approach does have advantages, in particular, it encourages
participation from institutions within the caBIG program, it has arguably compounded
the issue of CDE standardization as these services and data elements are often
„retrofitted‟ under the caGRID umbrella. Furthermore, the current mechanisms for
service discovery tend to organize services within the context of that parent application.
While this is useful in some situations, this approach tightly couples the service
implementation and workflow design phase and thus requires the workflow author to
have an intimate knowledge of the deployed applications and services. Ideally the
design of a workflow should be independent of the implementation layer.
The current mechanisms for searching for services within caGRID are still very
primitive. For effective workflow composition it must be possible to clearly identify the
inputs, outputs and functions of a service from a domain as well as an IT perspective.
While caBIG does leverage known ontologies, these domain concepts are not
transparently propagated to the CDE, let alone the search utilities of a workflow
composition tool, an observation mirrored by Shirononoshita 6 et al (2008). For a
Shironoshita EP, Jean-Mary YR, Bradley RM, Kabuka MR.
semCDI: a query formulation for semantic data integration in caBIG.
workflow design tool to be effective, it must have the ability to triage a search for
compatible services and potentially infer possible solutions using CDE public ID but also
it underlying concept and data type.
Recommendations: Based on these experiences the following is a suggested list of
best practices relating to the use and discovery of services for workflow composition:
1. All service data elements must be mapped to an existing standard CDE defined
within the caDSR, if a standard concept does not exist, the a new one must be
created. This recommendation correlates closely with the proposed caBIG Gold
Level compatibility model and emphasizes the need for the adoption of this
model in order to create an effective workflow environment.
Convert service data type to into the data type defined by the standard
Transform all values into those supported by selected standard CDE Term
Tag the service CDE with the standard CDE public ID
The mapping process should be internal to the grid service.
Service level CDE can be treated as a synonym for the standard CDE but
may be qualified by value domain restrictions.
The standard CDE term must contain all permissible values for that
2. Apply and expose controlled vocabulary and ontology concepts to both CDEs
and services, permitting the discovery of services from a domain rather than IT or
application perspective. Gill et al7, 2006 has described the benefits from utilizing
semantics within workflow design and execution. To build an effective workflow,
the need to be able to search at a concept level is required, discovery of services
based only on CDE compatibility will be insufficient. This functionality does not
appear to be readily available at either the client or API levels within the current
J Am Med Inform Assoc. 2008 Jul-Aug;15(4):559-68. Epub 2008 Apr 24.
PMID: 18436897 [PubMed - indexed for MEDLINE]
Gil, Y., Ratnakar, V., Deelman, E., Mehta, G. and Kim, J
Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational
Workflows. Proceedings of the 19th Annual Conference on Innovative Applications of Artificial Intelligence
(IAAI), Vancouver, British Columbia, Canada, July 22-26, 2007
Ontology: NCI Thesaurus
CDE - Synonym
CDE – Standard Term
Figure 2: service CDE to standard CDE tree
3. Actively encourage and bias the creation of application independent grid
4. Automatically generate and display metrics for all grid services deployed which
describes the percentage reuse of pre-existing CDE
5. Directly rate and record comments about services within the caGRID portal.
Thereby allowing other users to benefit from their experience and provide
feedback to the author. Workflow tools could also exploit this ranking when
suggesting options to an author. This functionality can be further exploited to
determine service usage levels. These metrics could be used to help justify
further expenditure in developing or maintaining a service. (See Service Stability
6. Enable search restrictions based upon the input, output and functional
characteristics of any service. In theory, this recommendation will hopefully be
addressed by the introduction of the adoption of Gold level.
Search process should automatically navigate and infer from the „service
CDE to standard CDE tree‟ (Figure 2) related services
Data services should be discoverable by the CQL common data elements
it produces (output) and consumes (input / restriction)
Where possible, the general functional characteristics of a service should
be mapped to known ontology concepts.
7. Programmatically expose the compatibility level of a service (Gold, Silver or
Bronze) within the context of specific version of caGRID.
5. Workflow Implementation
The final workflow that was implemented can be found here:
6. Group Members