National Cancer Institute
caArray Database in the cancer Biomedical Informatics Grid, caBIG
Project Goals: • • • • MIAME compliant data annotation Data sharing Provide microarray data for analytical applications via the MAGE-OM API Data Integration across different domains in caBIG/caGRID
National Cancer Institute
The NCI Center for Bioinformatics: Who are we?
• The Center for Bioinformatics is the NCI’s strategic and tactical arm for research information management • We collaborate with both intramural and extramural groups • Production, service-oriented organization • Mission to integrate and harmonize disparate research data
National Cancer Institute
Motivation for the caBIG program
• Overwhelming volume of data • Multitude of sources • Each part of the health community speaks its own scientific “dialect” (e.g. lab values, genetic profile, clinical data) • Lack of consensus on common standards and terms • Lack of coordination across, and collaboration within, the cancer research enterprise
National Cancer Institute
Cancer Biomedical Informatics Grid™ (caBIGTM)
• Common, widely distributed infrastructure permits research community to focus on innovation
• Shared vocabulary, data elements, data models facilitate information exchange • Collection of interoperable applications developed to common standards
• Biomedical research data will be available for mining and integration
National Cancer Institute
Current caBIG™ community
• NCI-designated Cancer Centers (50)
– Academic Centers (integrated into broader biomedical infrastructure) – Stand-alone (community leaders) – Community outreach
• • • • •
NCI Divisions NIH NECTAR grantees Government Industry International Groups
– Standards development organizations – U.K.’s National Cancer Research Institute
•
~800 active participants
Four Domain Workspaces and two Cross Cutting Workspaces have been launched
DOMAIN WORKSPACE 1 Clinical Trial Management Systems DOMAIN WORKSPACE 2 Integrative Cancer Research DOMAIN WORKSPACE 3 Tissue Banks & Pathology Tools DOMAIN WORKSPACE 4 Imaging Addresses the need for consistent, open and comprehensive tools for clinical trials management. Provides tools and systems to enable integration and sharing of information. Provides for the integration, development, and implementation of tissue and pathology tools. Provides for the sharing and analysis of in vivo imaging data.
Responsible for evaluating, developing, and integrating CROSS CUTTING WORKSPACE 1 systems for vocabulary and ontology content, Vocabularies & Common standards, and software systems for content delivery Data Elements Developing architectural standards and architecture necessary for other workspaces. CROSS CUTTING WORKSPACE 2 Architecture
National Cancer Institute
caArray Data Portal
http://carraydb.nci.nih.gov
•Number of raw data files in the database: 699 (565 .cel, 134 .gpr) •Number of public experiments in the database: 35 •Registered users:306 •Number of downloads from the download center: 460 •Average unique users per month: 282 •Average time spent per session: 16 minutes
National Cancer Institute
caArray: Compliance with Standardization Efforts
• MIAME
– Minimum Information About a Microarray Experiment
– 1.1 Draft 6 (April 1, 2002) – http://www.mged.org/Workgroups/MIAME/miame_1.1.html • MAGE-ML – MicroArray and GeneExpression Object Model and Markup Language – 1.1 (October 2003) – http://www.omg.org/docs/formal/03-10-01.pdf • MGED Ontology – Microarray Gene Expression Data Ontology – 1.1.8 (April 2004)
caAMEL- caArray Integration October 2006
– http://mged.sourceforge.net/ontologies/MGEDontology.php
National Cancer Institute
caBIG Compatibility Guidelines
National Cancer Institute
caArray Workflow
National Cancer Institute
MGED Ontology (MO) and caArray
Now: MO stored in caArray Future: caArray integration with the Enterprise vocabulary services (EVS)
caCORE Server caArray caCORE LexBIG Mediator API API EVS
Two Categories: 1) Terms provided by MO, e.g. BioMaterialLabel 2) MO references existing ontologies, e.g. DiseaseState
National Cancer Institute
LabelCompound: Compounds that are used for labeling extracts.
National Cancer Institute
National Cancer Institute
Ontology Entry External (to the MGED ontology) controlled vocabulary or ontology that can be referred such as ICD-9 or Gene Ontology.
Used in classes:
– CellLine CellType ClinicalHistory ClinicalTreatment Compound DevelopmentalStage DiseaseStaging DiseaseState FamilyMember GeographicLocation Histology Organism OrganismPart Phenotype SequenceOntologyBioSequenceType StrainOrLine TargetedCellType TestType TumorGrading
National Cancer Institute
BioMaterialCharcteristics, Disease State The name of the pathology diagnosed in the organism from which the biomaterial was derived. The disease state is normal if no disease has been diagnosed.
National Cancer Institute
DiseaseStateDatabase
National Cancer Institute
Ontology Entry in caArray
External (to the MGED ontology) controlled vocabulary or ontology that can be referred such as ICD-9 or Gene Ontology.
National Cancer Institute
BioMaterialCharacteristics in caArray
National Cancer Institute
National Cancer Institute
caArray Data Portal & Data Analysis Tools
1. Data Portal: submission of original, raw data files with associated experiment and sample information (MIAME annotations). Data analysis and visualization tools: caBIG tools: 1. 2. 3. 4. 5. 6. 7. 8. caWorkbench - Columbia GenePattern – MIT/Broad DWD - UNC Lineberger GenePattern - MIT/Broad VISDA – Georgetown Cancer Molecular Pages – Burnham Function Express – Wash U Siteman webGenome - RTI
2.
National Cancer Institute
caArray Configuration (federated)
caArray 1 caWorkbench caBIO
caArray schema
caDSR / EVS
caARRAY EJB
MAGE-OM API
Security
JAVA APP
MAGE-ML
caGRID (future)
MAGE-OM API NCICB Security
GRID
caARRAY EJB
caArray schema caWorkbench
caDSR / EVS
caBIO
NCICB
National Cancer Institute
geWorkbench www.geworkbench.org
Columbia University, NY: Andrea Califano, Aris Floratos, Manjunath Kustagi
Key Features: • Gene expression data – Support for several formats – Seamless integration with caArray – Rich collection of visualizers – Filtering, normalization, analysis components – • Reconstruction and visualization of molecular interaction networks
Sequence analysis – Sequence homology (BLAST, HMM, Smith-Waterman) – Promoter analysis – Motif discovery – Synteny Annotations – Integration with CGAP gene annotations and BioCarta pathways – Mapping to GO categories Development platform – Open source, Java-based – Component architecture, facilitating customization
•
•
GenePattern: A platform for integrative genomics
Ted Liefeld, Broad Inst.
Module Repository Pipeline Environment
MAGE
caArray, file, or URL
Graphical Environment
65+ Modules
Microarray Analysis Proteomics Clustering Marker selection SNP/Copy # Data visualization
MAGE-ML ImportViewer
Import MAGE-ML data
Bicluster
PreprocessDataset
extract breast samples
Heat Map Prediction Results
-add your own analyses methods to GenePattern
Task Integrator
GeneNeighbors
compute nearest neighbors of cyclin D1 in breast cells
SelectFeaturesColumns
extract ovary samples
Programming Environments
# source("D:/CGP2003/GenePattern_modules/Golub_et_al_1999.R", echo = TRUE) # GenePattern # # Molecular Classification of Cancer: Class Prediction by Gene Expression # # Summary: This R/GenePattern script implements the supervised prediction metho # in Golub et al 1999, Science 286:531-537 (1999). # Load and set up GenePattern commands and server
SelectFeaturesRows
get expression data for breast neighbors in ovary cells
source("http://wilkins.wi.mit.edu:7070/gp/GenePattern.R", echo = FALSE, print.ev server <- SOAPServer("http://wilkins.wi.mit.edu", "/axis/servlet/AxisServlet", 7 source(paste("http://", server@host, ":", server@port, "/gp/getAllTaskWrappers.j # Neighborhood analysis MS.out <- MarkerSelection("data.filename" = "http://www-genome.wi.mit.edu/mpr/pu "class.filename" =“” "pred.results.file" = "pred.results", "data.results.file" = "data.results", "num.permutations" = "25", file.show(MS.out$pred.results) file.show(MS.out$data.results.gct)
ConvertToMAGEML
write out MAGE XMl file
data <- read.table(MS.out$pred.results, header=T, sep="\t", skip=14)
GenePattern
MAGE
data file
National Cancer Institute
caGrid 0.5.4
current stable, supported, grid environment for caBIG Advertise, Query, and Discover services and data
caGrid 0.5 Test Bed
Current caGrid 0.5 Nodes NCICB 1. GUMS 2. CAMS 3. GME 4. caBIO 5. caArray 6. caMOD 7. SNP500Cancer Duke 8. rProteomics Pittsburg Medical Center 9. caTIES
caBIO caArray
caTIES
Pittsburgh
GUMS Index Service CAMS
caArray
Georgetown
caBIG
NCI
PIR
Duke
rProteomics
GME
Lombardi Georgetown 10. PIR: proteomics inf. resource
National Cancer Institute
http://cagrid-browser.nci.nih.gov
caGrid Development team:
Ohio State University NCICB SAIC, Oracle, Panther Informatics, TerpSys, BAH
National Cancer Institute
Discovery utilizing caBIG™ “Idealized” ICR Data Integrated Cancer Research Tools Flow
FunctionEx press
Promoter DB
Identify recurring promoter elements Gene Pattern
Gene annotation
Analysis Identify up-regulated genes in specific pathways
Clinical Trials Tumor Samples
400 brain tumor tissue samples acquired Pathology reports
Gene expression profiling
caArray
Pathways Tool
caTIES
Discrete and manual annotation on tissues Clinical Annotation Modules
caTissue
Potential Drug Targets and Biomarkers
Mutation identification
TrAPSS
Proteomics LIMS
Analysis Q5
Annotation PIR
National Cancer Institute
NCICB Download http://ncicb.nci.nih.gov/download & GForge site http://gforge.nci.nih.gov
National Cancer Institute
Training and Support ncicb@pop.nci.nih.gov
National Cancer Institute
Acknowledgements
caArray team: • • • • • • • • Mervi Heiskanen (NCI) Anand Basu (NCI) Xiaopeng Bian (NCI) Rob Daly (5am Solutions) Eric Tavela (5am Solutions) Leslie Power (5am Solutions) Steven Matyas (5am Solutions) Jerry Eads (NTVI) heiskame&mail.nih.gov caBIG Developers & Adopters EVS team: Gilberto Fragoso (NCI) Frank Hartel (NCI)
Training & Support:
Don Swan (TerpSys)