Learning Center
Plans & pricing Sign in
Sign Out



									                              A Practical Guide to

                     caBIG® Integrative Query Tools

                            Prepared for the caBIG® Community
                                           by the
                           Documentation and Training Workspace
                      In cooperation with the caGrid Knowledge Center

                                        April 2011

A Practical Guide to caBIG® Integrative Query Tools                     1
This document presents an overview and comparison of data query and integration tools
available through the cancer Biomedical Informatics Grid® (caBIG®). While many caBIG®
applications have built-in query functionality, the query tools featured in this document represent
those that can query across disparate data sources and/or data types, which is a critical
component of translational research. The tools covered in this document are caGrid Portal,
caGrid FQP, caB2B, caBIO Portlet, caIntegrator, geWorkbench and Taverna.
While documentation exists about each of these tools individually (links provided below), there
has been a lack of material that compares and contrasts them. This document aims to fill this
gap, comparing functionalities across these tools. This comparison is intended to assist
deployment leads, project managers or implementation staff in distinguishing between them and
aid in deciding which tools might be best for their institution.
It should be noted that the content presented here was distilled from other sources (e.g., the
caBIG® website and Knowledge Center Wiki pages). It should be considered a living document
that will change as technologies, tools and the community changes (see Feedback section).
The first part of this document describes each tool on its own, with links to additional
information. These overviews are followed by a table that compares the tools to one another
with respect to specific variables. The final section outlines considerations one might take in
selecting tools.

The Tools
caGrid Portal

The caGrid Portal provides several ways to query for and view information about data sets,
people, tools, communities, and institutions that participate in caBIG®. These entities make up
the caGrid Portal catalog. Users can create ad-hoc communities around any aspect of the portal,
such as their institution, research focus, or deployment groups and these communities can be
private or public. Communities can have forums, message boards, file systems, news feeds,
calendars and wikis. The portal provides access to many tools, such as the caBIO Portlet. The
caGrid Portal is a true community application, providing access to popular social media tools to
bookmark and share interesting items found.

The Portal provides a web interface for the creation of CQL (Common Query Language) queries
against a single grid service or Distributed CQL (DCQL) queries against several grid data
services by utilizing the features of the Federated Query Processor (FQP), a core caGrid Service.
The caGrid Portal exposes a subset of FQP functionality, allowing users to execute a single

A Practical Guide to caBIG® Integrative Query Tools                                                2
query against several services and retrieve an aggregated result containing the results from each
data service query. Once created, these user generated queries can be stored on the Portal as
Shared Queries, which can be accessed by multiple users.

For more detailed information:
caGrid Portal Documentation:
caGrid Portal:


caB2B (cancer Bench-to-Bedside) is a query tool that permits users to search and combine data
from virtually any caGrid data service. The suite is composed of three core components: the
Web Application, the Client Application and the Administrative Module. The Web Application
module provides query templates for easy search and retrieval of microarray data (from
caArray), imaging data (from the National Biomedical Imaging Archive (NBIA)), specimen data
(from caTissue) and nanoparticle data (from caNanoLab) via caGrid. Installation of the Client
and the Administrative modules provide increasing levels of customization, access to other data
sources, and query building functionality.

For more detailed information:
caBIG Community Website:
Knowledge Center Wiki:

caBIO Portlet

The caBIO Portlet is a portal user interface built on top of caBIO (Cancer Bioinformatics
Infrastructure Objects). caBIO is a repository of gene annotations useful in biomedical research,
compiled from multiple primary data sources. These sources include CGAP, Unigene, and the
Cancer Gene Index (gene-disease-compound associations), as well as a number of array
manufacturers (for microarray annotations). The caBIO domain model includes the rapidly-
advancing genomics and proteomics domains and integrates with clinical trial registration
information, pathway annotations and literature citation. This allows researchers to discover
associations in the data that were previously unseen in the separate datasets, thereby enhancing
cancer research and drug design. The Portlet is an alternative user interface for caBIO that
provides easier access to caBIO data for non-programmers through simple and templated search
For more detailed information:
caBIO Portlet Wiki:
caBIG Community Website:
caBIO Wiki:

A Practical Guide to caBIG® Integrative Query Tools                                                 3

caIntegrator is a web-based tool that combines a dynamic back end database with access to
custom user querying for use with data analysis and visualization tools. It provides, in the form
of one common software platform, a single point to which many different types of biological and
clinical data from a study/clinical trial can be stored and then analyzed together. This ability to
explore the data will be valuable to both researchers and study managers. It allows researchers
to set up custom, caBIG®-compatible web portals to conduct integrative research, without
requiring programming experience. These portals bring together heterogeneous clinical,
microarray and medical imaging data to enrich translational research. Researchers can execute,
save and share queries to identify and collect many types of data, enabling multidimensional
analysis. caIntegrator uses caGrid analytical services such as GenePattern and BioConductor as
well as several built in tools to perform analysis on the integrated study data, including clinical
survival data.

For more detailed information:
caBIG Community Website:
Knowledge Center Wiki:


geWorkbench is a platform for integrated genomics, offering capabilities in the analysis and
visualization of gene expression, sequence, and protein structure data. It offers direct access to
numerous external data sources, including caArray, the Cancer Gene Index (gene-disease-
compound associations), The Cancer Genome Atlas, BioCarta and the Pathway Interaction
Database, as well as to sequence, molecular interaction, and protein structure databases. Written
in Java for use on the desktop, geWorkbench is cross-platform and has an extensible,
component-based architecture. It gives scientists transparent access to a number of external data
sources and algorithmic services, combining these with many built-in tools for analysis and
visualization (at present more than 40 distinct analysis and visualization modules are part of the

For more detailed information:
caBIG Community Website:
Knowledge Center Wiki:


The Taverna Workbench, utilizing Taverna’s Workflow Management System (WMS), allows
users to construct complex scientific workflows, which can consist of multiple types of
components, each called a processor. These components may be located on different machines.

A Practical Guide to caBIG® Integrative Query Tools                                               4
Their execution is orchestrated by the Taverna Engine (run from the Workbench), and the results
are gathered and shown in the Workbench.

The workbench provides an infrastructure to setup, execute, and monitor scientific workflows,
providing an environment where in-silico experiments can be defined and executed. Taverna is
bundled with the caGrid Portal. It provides a way of specifying the tasks that have to be
performed during a specific in silico experiment. Plugins allow users to search for services to
include in their workflows, or to share workflows via the myExperiment site and include and run
caGrid services.

For more detailed information:
caBIG Community Website:
Taverna Website:

caGrid Federated Query Processor (FQP)
The caGrid Federated Query Processor (FQP) is a grid service that provides a mechanism to
perform basic distributed aggregations and joins of queries over multiple grid data services. As
caGrid data services all use a uniform query language, CQL, as well as a uniform grid interface,
the Federated Query Infrastructure can be used to express queries over any combination of
caGrid data services. Federated queries are expressed with a query language, DCQL, which is an
extension to CQL to express such concepts as joins against other data services, aggregations, and
target data services. Integration with FQP is performed programmatically through the use of the
FQP APIs, thus allowing developers to implement distributed querying in their Java-based
applications, such as the caGrid Portal.
For more detailed Information:
caGrid Federated Query Processor:
caGrid Distributed Common Query Language (DCQL):

A Practical Guide to caBIG® Integrative Query Tools                                             5
Comparison Grid

                                                                   Processor        caBIO
Features                               caB2B     caGrid Portal       (FQP)          Portlet       caIntegrator    geWorkbench        Taverna
Base infrastructure                  Web,        Web             Web              Web             Web            Desktop app        Desktop &
                                     Client,                                                                     (cross platform)   Web
Local install                        Yes         Yes             Yes              Yes (with       Yes            Yes                Yes
                                                                                  caGrid Portal

Install level                        Basic       Intermediate    Intermediate     Intermediate    Intermediate   Basic (wizard      Basic
                                     (wizard     (Node set up)                                                   based)             (wizard
                                     based)                                                                                         based)
Stability level                      Stable      Stable/Mature   Stable           Mature          Stable         Stable             Mature
Hosted instance(s) (i.e. NCI)        Yes         Yes             Yes (through     Yes (through    Yes            No                 Yes (thru
                                                                 caGrid Portal)   caGrid Portal                                     caGrid
Advanced Customization (local        Yes (thru   No              Yes              No              Yes            Yes                Yes
control access, join                 admin
modifications, etc.)                 module)
Data sets types that can be
       Expression / Molecular data   Yes         NA (done thru   NA (done thru    NA              Yes (caGRID-   Yes (gene          yes
                                     (caArray)   caGRID node)    caGRID node)                     caArray-       expression,
                                                                                                  mRNA)          sequence,
                      Imaging data   Yes         NA (done thru   NA (done thru    NA              Yes (caGRID-   No                 yes
                                     (NBIA)      caGRID node)    caGRID node)                     NBIA /

A Practical Guide to caBIG® Integrative Query Tools                                                                                             6
                                                                         Processor             caBIO
Features                                 caB2B       caGrid Portal         (FQP)               Portlet    caIntegrator      geWorkbench       Taverna
            Specimen/Tissue data       Yes          NA (done thru      NA (done thru      NA              No                No              yes
                                       (caTissue)   caGRID node)       caGRID node)

                Other Clinical data    Yes          NA (done thru      NA (done thru      NA              Yes (via          No              yes
                                       (through     caGRID node)       caGRID node)                       spreadsheet)
                 NanoParticle data     Yes          NA (done thru      NA (done thru      NA              No                No              yes
                                       (caNanoLa    caGRID node)       caGRID node)
Integrated data sources
available for query
       Expression / Molecular data     Yes          Yes (if avail in   Yes (if avail in   Yes (caBIO)     Yes (caArray,     Yes (caArray,   Yes (if avail
                                       (caArray)    caGRID)            caGRID)                            caBIO)            caBIO (Cancer   in caGRID)
                                                                                                                            Gene Index,
                     Imaging data      Yes          Yes (if avail in   Yes (if avail in   No              Yes               No              Yes (as avail
                                       (NBIA)       caGRID)            caGRID)                                                              in caGRID)

            Specimen/Tissue data       Yes          Yes (if avail in   Yes (if avail in   No              No (future        No              Yes (as avail
                                       (caTissue)   caGRID)            caGRID)                            release)                          in caGRID)

                Other clinical data    Yes          Yes (if avail in   Yes (if avail in   Yes (as         Yes (portal set   No              Yes (as avail
                                       (through     caGRID)            caGRID)            associated      up per study)                     in caGRID)
                                       Admin                                              within caBIO)
                 NanoParticle data     Yes          Yes (if avail in   Yes (if avail in   No              No                No              Yes (as avail
                                       (caNanoLa    caGRID)            caGRID)                                                              in caGRID)
             Metadata repositories     Yes          Yes (caDSR)        Yes (caDSR)        Yes             Yes (caDSR)       No              Yes (as avail
                                       (caDSR)                                                                                              in caGRID)
               Citations/ literature   No           Yes (through       Yes (through       Yes (as         No                Yes (Cancer     Yes (as avail
                                                    caBIO Portlet      caBIO Portlet      associated                        Gene Index      in caGRID)
                                                    plug in)           plug in)           within caBIO)                     through caBIO

A Practical Guide to caBIG® Integrative Query Tools                                                                                                         7
                                                                  Processor      caBIO
Features                         caB2B        caGrid Portal         (FQP)        Portlet       caIntegrator     geWorkbench        Taverna
Query methods / techniques     Form &        Integrated &      Integrated &   Simple &        Integrated &     Integrated &      Workflow
                               keyword       Shared            Shared         templated       shared           shared            design
Analysis and processing        No            No                No             No              Yes (caGRID      Yes (analysis &   Yes
services                                                                                      analytical       visualization -   (processing
                                                                                              services - KM    over 40           tools)
                                                                                              Plots/Gene       methods)
                                                                                              ne Pattern)
Workflow building &            Yes           Yes (thru         No             No              No               Yes               Yes
visualization                  (visualizat   Taverna plug                                                      (visualization)   (workflows)
                               ion)          in)
User interaction &             No            Yes (through      No             No              Yes (shared      Yes (shared       No
communication tools                          communities,                                     queries)         queries)
                                             shared queries,
Exploration & discovery        No            Yes (thru         No             Yes (as         No               Yes               No
(background research for                     catalog of                       associated
hypothesis)                                  resources)                       within caBIO)
Plug ins to other tools        No            Yes (caBIO        Yes (caGrid    No              Yes (caDSR,      Yes (caGRID       Yes (as avail
                                             Portlet)          Portal)                        caBIO,           caArray, caBIO    in caGrid)
                                                                                              caArray, NBIA,   (Cancer Gene
                                                                                              GenePattern &    Index, CGAP,
                                                                                              BioConductor)    BioCarta),

A Practical Guide to caBIG® Integrative Query Tools                                                                                              8
Making a Decision – Some Considerations
The following discussion builds upon the matrix above, by offering additional considerations
that may help with tool selection. Ultimately the right decision will be based on assessment of
the specific needs, attributes, and resources available at your institution.

To Share or Not?

One of the factors that may influence the choice of software package is the level of interest of
researchers within the institution in sharing data to the grid. Researchers may be delighted to
query and examine data sets posted by others but be less comfortable when it comes to granting
access to their own data. There may also be concerns around data security and sensitivity. Tools
such as caB2B, caIntegrator, geWorkbench, and Taverna can accommodate local data sets and
queries. Assessing the level of concern around this within your institution will help guide this

Types of data and formats

One of the primary considerations in determining which tools may be most suitable for your
organization is the potential data types and formats of the data that researchers within your
organization may most want to query or analyze. By glancing at the comparison grid above, you
may have noticed that microarray data is strongly represented and covered by most of these
tools. However, other data types (e.g. tissue/specimen and other clinical data) are not yet as
strongly covered.

caB2B does have the ability to query tissue/specimen data (caTissue) that has been posted to the
grid and caIntegrator can accept uploads of other clinical data through a pre-specified
spreadsheet upload. caIntegrator does have capability to accept data that does not exist as a grid
data type yet. It is expected that these non-grid data types will still be mapped to elements in
caDSR (cancer Data Standards Repository) but in the event that elements do not exist in caDSR,
caIntegrator will still accept them for upload.

Simply query or query, process and analyze?

When exploring the desired data types and formats mentioned above, you can also explore what
level of query functionality your users are most interested in. If the primary need is to explore
existing data (either locally or via the grid) then caB2B, caGrid Portal, caGrid FQP or caBIO
Portlet would be suitable. These tools may also be particularly powerful for pre-study
exploration and hypothesis elaboration. For example, a query of caBIO Portlet on ‘colorectal

A Practical Guide to caBIG® Integrative Query Tools                                                 9
cancer’ pulled up the following results with each (#) a link, that can be drilled down and explored

“Evidence (3287), Library (572), Protocol Association (108), Clinical Trial Protocol (67), Disease Ontology
(26), Gene (10), Protein (7), Agent (5), Protein Alias (2)”

If there is a potential need to apply data processing steps and/or some analysis tools on top of the
query results or data, then caIntegrator, geWorkbench or Taverna may be more suitable options.

If the data processing needed involves multiple, complex steps, and/or there is interest in
automating the steps, the Taverna Workbench is specifically designed to manage scientific

When considering the analysis of gene expression (microarray data), consideration should be
given to the level of sophistication needed for the analysis. caIntegrator itself provides KM
plots and Gene Expression plots. For more advanced analysis, caIntegrator can transfer data
directly into GenePattern, and can also make use of several GenePattern grid services, which
may be adequate for many researchers. If more sophisticated array analysis is needed,
geWorkbench has over 40 methods available.

Desire to collaborate with others

If there is a strong interest within your organization to reach out and query for similar research
interests (data or individuals) for potential multi-site research collaborations, then caGrid Portal
deserves consideration. This tool has a rich, unique set of features to nurture and foster
collaborative research. Users can create communities around a particular research interest and
these communities can be private or public. From the Portal, these communities have forums,
message boards, file systems, news feeds, calendar and wikis at their disposal.

Level of IT support

Finally, with implementation of any software application in an organization, consideration
should be given to the complexity of the implementation (installation and maintenance) and well
as the complexity of using the software itself. Are there enough IT and/or informatics resources
within your organization to support them? This may influence whether you decide to pursue a
local installed instance versus an NCI hosted instance.

Further questions on the topics covered in this document are welcomed and should be submitted
to the appropriate subject forums at the caGrid Knowledge Center. Feedback on this document

A Practical Guide to caBIG® Integrative Query Tools                                                     10
is also welcome and can be submitted to the General Discussions Forum at that site.

Acknowledgements & Credits

Lara C. Fournier, M.S.
(caBIG® Documentation and Training Workspace)
Bioinformatics Shared Resource
OHSU Knight Cancer Institute
Portland, Oregon


Fred Loney, M.S.
(caBIG® Deployment Lead)
Bioinformatics Shared Resource
OHSU Knight Cancer Institute
Portland, Oregon

Kenneth C. Smith, Ph.D .
(caBIG® Documentation and Training Workspace)
Joint Centers for Systems Biology
Columbia University
New York, New York

William Stephens
(caBIG® caGRID Knowledge Center)
Center for IT Innovations in Healthcare
Ohio State University

TJ Andrews, M.S.
(caBIG® Support Service Provider, caIntegrator Project)
ScenPro, Inc.

A Practical Guide to caBIG® Integrative Query Tools                                   11

To top