National Cancer Institute
The Cancer Biomedical Informatics Grid™ (caBIG™)
2006 CODATA Conference Beijing, China
Mary Jo Deering , Ph.D. Director, Informatics Dissemination NCI Center for Bioinformatics
National Cancer Institute
Cancer Biomedical Informatics Grid™ (caBIGTM)
• Common, widely distributed infrastructure permits research community to focus on innovation
• Shared vocabulary, data elements, data models facilitate information exchange • Collection of interoperable applications developed to common standards
• Raw published cancer research data is available for mining and integration
National Cancer Institute
Cancer Biomedical Informatics Grid™ (caBIGTM)
• caBIG™ infrastructure and tools are widely
applicable outside cancer
• caBIG™ components may be used by anyone
National Cancer Institute
caBIG™ principles
• Open source • Open access
• Open development
• Federated
National Cancer Institute
caBIG™‟s Informatics Core
National Cancer Institute
caBIG™ Operational Structure
cc a B IG ™ a B IG ™
SS tra te g ic tra te g ic PP la n n in g la n n in g W oo rk s p a c e W rk s p a c e
C lin ic aa l C lin ic l TT ria lsM gg m t ria ls M m t SS ys te m s ys te m s W oo rk s p a c e W rk s p a c e
In te gg ra tiv e In te ra tiv e C aa n c e r C ncer R ee s e a rc h R s e a rc h W oo rk s p a c e W rk s p a c e
In VV iv o In iv o Im aa g in g Im g in g W oo rk s p a c e W rk s p a c e
TT is s u e is s u e B aa n k s& B nks & PP a th o lo g y a th o lo g y TT o o ls o o ls W oo rk s p a c e W rk s p a c e
TT ra in in g ra in in g W oo rk s p a c e W rk s p a c e
D aa taSS h a rin g& D ta h a rin g & In te lle cc tu a l In te lle tu a l C aa p ita l C p ita l W oo rk s p a c e W rk s p a c e
cc a B IG T MVV o c a b u la rie saa n dC oo m m o nD aa taEE le m e n tsW oo rk s p a c e a B IG T M o c a b u la rie s n d C m m o n D ta le m e n ts W rk s p a c e
cc a B IG T MA rc hh ite c tu reW oo rk s p a c e a B IG T M A rc ite c tu re W rk s p a c e
National Cancer Institute
2006 Clinical Trial Tools Development Activities
• caAERS • Patient Study Calendar • Lab Data Hub • Making other CTMS systems caBIG compatible
National Cancer Institute
Clinical Research IT Infrastructure
Translation Service
HL7v2.x, other HL7v3 HL7/ CAM SDK
Clinical Systems
Clinical Trials
etc.
External Reporting
HL7-v3, Janus HL7-v3, Janus
Lifecycle Management
Adverse Events
Labs, EMR, Tissue, etc.
HL7 transactional database
Participant Registry EDC
FDA Clinical Research Information Exchange SPONSOR NCI other
Clinical Data Mgmt
Patient Health Record
De-identification Services
Research Data Warehouse
National Cancer Institute
Integrated Cancer Research
• Microarray Repositories • Data Analysis & Statistics • Informatics for Proteomics • Genome Annotation • Pathways Tools • Translational Tools • Population Sciences and Cancer Control
National Cancer Institute
National Cancer Institute
National Cancer Institute
National Cancer Institute
Tissue Banks and Pathology Tools
• caTISSUE Core (WU) – Core specimen handling and tracking functions
• caTISSUE Clinical Annotation Engine (UPMC) - Annotation of specimens with clinical data • caTIES (UPMC) - Text extraction and de-identification of surgical pathology reports
caTISSUE Core:
National Cancer Institute
Register Specimen Group
National Cancer Institute
caIMAGE – Cancer Images Database
• caIMAGE allows researchers to submit and retrieve images and annotations. Images are streamed for efficient access. Researchers can search images based on tissue and diagnosis and experiment information. Use of common terminology originating from the NCI Enterprise Vocabulary Server (EVS).
• •
•
National Cancer Institute
National Cancer Institute
caBIG™ Compatibility
•
•
caBIG™ is all about Interoperability
– – – –
Key is to create tools for sharing information Expandable and modular software to plug into existing systems so current development efforts are not wasted Encourages relationships between academic, government and industry Compatibility guidelines are being translated into certification procedures
Extensible infrastructure
•
• •
Ensures partnerships
Evolving
Compatibility Guidelines at https://cabig.nci.nih.gov/guidelines_documentation
National Cancer Institute
Interoperability
ability of a system to
access and use
the parts or equipment of another system
Syntactic interoperability Semantic interoperability
National Cancer Institute
caCORE
S E C U R I T Y
Bioinformatics Objects
Common Data Elements
Enterprise Vocabulary
National Cancer Institute
Professional Documentation
National Cancer Institute
caCORE Software Development Kit Components
• UML Modeling Tool (any with XMI export)
• Semantic Connector (concept binding utility) • UML Loader (model registration in caDSR) • Codegen (middleware code generator) • Security Adaptor (Common Security Module)
• caCORE SDK generates a caBIG-Silver compliant system
National Cancer Institute
National Cancer Institute
Grid Technology in caBIGTM
• What is a „Grid‟
– “A Grid is a system that coordinates resources that are not subject to centralized control using standard, open, general-purpose protocols and interfaces to deliver nontrivial qualities of service.” - Ian Foster Grid Today, July 20, 2002
•
Grid Technology supplies two useful components to a network of computers:
– Advertising: Inform the network about the capabilities of new systems
– Discovery: Allow users to find resources that meet their needs.
• • •
The caGrid project is the „Grid in caBIGTM‟; the actual infrastructure that data and analytical services will use to interoperate. The current caGrid is version 0.5; caGrid 1.0 in December. The combination of data and analytical service nodes in caBIGTM produced a design that utilizes a variety of standard Grid technologies including the Globus Toolkit and OGSA-DAI, DQP, GRAM, etc.
National Cancer Institute
Test bed Infrastructure
caGrid 0.5 Test Bed
National Cancer Institute
Cancer Biomedical Informatics Grid™ (caBIGTM)
• caBIG™ infrastructure and tools are widely
applicable outside cancer
• caBIG™ components may be used by anyone
National Cancer Institute
Contact Information
Mary Jo Deering, Ph.D Director for Informatics Dissemination NCI Center for Bioinformatics National Cancer Institute National Institutes of Health, USDHHS 6116 Executive Blvd. - #403 Rockville, MD 20852 (o) 301-496-3458 (f) 301-480-4222 deeringm@mail.nih.gov
National Cancer Institute
Additional Background and Detail
• The following slides were not included in the presentation.
26
National Cancer Institute
Current caBIG™ community
• NCI-designated Cancer Centers (50)
– Academic Centers (integrated into broader biomedical infrastructure) – Stand-alone (community leaders) – Community outreach
• • • • •
NCI Divisions and Programs National Institutes of Health Other Government Agencies Industry International Groups
– Standards development organizations – U.K.’s National Cancer Research Institute
•
~900 active participants
27
National Cancer Institute
Four Domain Workspaces and two Cross Cutting Workspaces have been launched
DOMAIN WORKSPACE 1 Clinical Trial Management Systems DOMAIN WORKSPACE 2 Integrative Cancer Research DOMAIN WORKSPACE 3 Tissue Banks & Pathology Tools DOMAIN WORKSPACE 4 Imaging Addresses the need for consistent, open and comprehensive tools for clinical trials management. Provides tools and systems to enable integration and sharing of information. Provides for the integration, development, and implementation of tissue and pathology tools. Provides for the sharing and analysis of in vivo imaging data.
Responsible for evaluating, developing, and integrating CROSS CUTTING WORKSPACE 1 systems for vocabulary and ontology content, Vocabularies & Common standards, and software systems for content delivery. Data Elements Developing architectural standards and architecture necessary for other workspaces. CROSS CUTTING WORKSPACE 2 Architecture
28
National Cancer Institute
Strategic Level Workspaces
Data Sharing and Intellectual Capital
Addresses issues related to the sharing of data, applications and infrastructure both within the consortium and in the larger cancer research community.
Training
Developing strategies for providing training in the use of the caBIG developed resources including on-line tutorials, workshops, and training programs.
Strategic Planning
Assists in identifying strategic priorities for the development and evolution of the caBIGTM effort.
29
National Cancer Institute
REMBRANDT: Building a robust translational research framework for brain tumor studies REpository of Molecular BRAin Neoplasia DaTa
http://rembrandt.nci.nih.gov
30
National Cancer Institute
Rembrandt Knowledgebase
Expression array data
caIntegrator DataMart
Better understanding
SNPArray data
Better treatments
Clinical data
Proteomics data
caBIG Analytic Tools
31
National Cancer Institute
caBIGTM Compatibility Guidelines
• The caBIGTM compatibility guidelines are designed to insure that systems designed in a Federated environment are still interoperable on the caBIGTM Grid, both syntactically and semantically • Since achieving interoperability is a process, caBIGTM recognizes four levels of compatibility, starting from Legacy (not interoperable) through Bronze, Silver and Gold (fully interoperable)
• caBIGTM compatibility is all about interfaces rather than the scientific content of the system
32
SYNTACTIC
SEMANTIC
SEMANTIC
caBIG Compatibility Guidelines
SEMANTIC
33
National Cancer Institute
National Cancer Institute
Common Data Elements
• What do all those data classes and attributes actually mean, anyway?
• Data descriptors or “semantic metadata” required
• Computable, commonly structured, reusable units of metadata are “Common Data Elements” or CDEs.
• NCI uses the ISO/IEC 11179 standard for metadata structure and registration • Semantics all drawn from Enterprise Vocabulary Service resources
34
National Cancer Institute
Cancer Data Standards Repository (caDSR)
• Basic caDSR unit of metadata information to describe a datum is a Common Data Element or CDE • Enterprise-class system for storing metadata, with APIs that give runtime access to both metadata and semantics • Implements the ISO 11179 standard, a flexible model for describing arbitrary metadata • Used to describe metadata associated with clinical case report forms and UML Models
35
National Cancer Institute
Enterprise Vocabulary Services
• Controlled vocabulary resources for caCORE and the cancer research community • Vocabulary Products and Services – NCI Thesaurus – NCI Metathesaurus
– External vocabularies
• NCI Thesaurus - controlled vocabulary source for metadata – Has excellent coverage of cancer terminology – Expands based on needs for additional terminology – Based on concepts rather than terms – Each concept has a unique identifier or CUI with definitions and synonym
36
National Cancer Institute
Data Standards in caBIG™
• • • The V/CDE workspace is responsible for facilitating the development and ratification of Data Standards for caBIG™ Data Standards can be Vocabularies or Common Data Elements (CDEs) with their associated controlled terminology A caBIG™ Data Standard is, in effect, a „pre-approved‟ mechanism for semantically modeling an attribute or series of attributes in a data object. Ideally, having a standard available shortens development time for other projects that need to present such data Whenever possible, caBIG™ adopts standards that are derived from other standards bodies (HL7, ISO, USPS, UPU, W3C, etc.) and in general use within our community In the last year, the V/CDE workspace has developed a consensus driven mechanism for approving Data Standards and applied it to an increasing number of CDEs
37
•
•
National Cancer Institute
caCORE Architecture
Middleware
A P I A P I A P I
Clients
Data
HTTP Clients
Web Application Server
Interfaces Java Biomedical Data
SOAP Clients
SOAP
Perl Clients
XML
Domain Domain Objects Objects [Gene, [Gene, Disease, Disease, etc.] Agent, etc.]
Data Access Objects
Common Data Elements
Java Applications
A P I
Data Access Objects
Enterprise Vocabulary
Authorization
38
National Cancer Institute
Use cases for caGrid
• Advertisement
– Service Provider composes service metadata describing the service and publishes it to grid.
• Discovery
– Researcher (or application developer) specifies search criteria describing a service of interest – The research submits the discovery request to a discovery service, which identifies a list of services matching the criteria, and returns the list.
• Invocation
– Researcher (or application developer) instantiates the grid service and access its resources
39
National Cancer Institute
caGrid 0.5 Services
• Data Services
– – caBIO: Gene-centric bioinformatics objects • NCICB-Rockville, MD caArray: MAGE-OM compliant microarray repository • NCICB-Rockville, MD • Lombardi Cancer Center-Georgetown, DC gridPIR: Protein Information Resource • Lombardi Cancer Center-Georgetown, DC caTIES: Text Information Extraction System for pathology reports • UPMC-Pittsburgh, PA SNP500: Polymorphism database with population frequencies • NCI Core Genotyping Facility-Gaithersburg, MD caMOD II: Cancer Model Organism Database • NCI Mouse Models of Human Cancer Consortium (MMHCC)
– – – –
•
Analytical Service
– RProteomics: Statistical analysis of proteomics data • Duke-Durham, NC 40
National Cancer Institute
caGrid Service-Oriented Architecture
Functions
Globus BPEL Management
Mobius
Workflow Resource Management Service
GRAM Globus
Metadata Management
Schema Management
Service Registry
ID Resolution
OGSA-DAI Service Description Globus Toolkit Grid Communication Protocol Transport
myProxy
GSI
caCORE
CAS
Security
Globus
OGSA Compliant - Service Oriented Architecture
41
National Cancer Institute
Enabling Technology
• The NCI provides freely available enabling technology for caBIGTM compatibility • These technologies are distributed under a „non-viral‟ open source license. • caCORE
– Enterprise Vocabulary Services (EVS) – Cancer Data Standards Repository (caDSR)
• caCORE Software Development Kit
– When complete process is followed, the outcome is a caBIG „Silver‟ compliant data system.
42
National Cancer Institute
How can my research benefit from caBIG™ Tools?
• Everything developed by the program is open source and freely available • Training is available at https://cabig.nci.nih.gov/training • The latest versions of all the software developed as part of the project can be obtained from the caBIG™ project gforge site: – http://gforge.nci.nih.gov
43
National Cancer Institute
caBIG™: Getting Involved
• To get involved with caBIG™: – Track caBIG™ activities on the NCI‟s caBIG™ website, https://cabig.nci.nih.gov/ – Attend caBIG™ Annual Meeting, February 5-7, 2007, Wardman Park Marriott, Washington, DC – Learn about the existing bioinformatics infrastructure, caCORE, at https://ncicb.nci.nih.gov/core – Download currently available caBIG™ tools from the caBIG™ website at https://cabig.nci.nih.gov/inventory – Sign up for the caBIG™ mailing list at http://list.nih.gov/archives/cabig_announce.html • Please visit the main caBIG™ website for more information: https://cabig.nci.nih.gov/
44