CARE Center Informatics Subcommittee Background and Progress Report for 10/20/06 Conference Call Marcia Nizzari Background Information On CARE informatics, GAP (Genetic Analysis Platform) and Informatics Team CaRE Center Informatics • Builds on existing Genetic Analysis Platform – Operational for 2+ years – Genotyping and Resequencing – Code base successfully reused • CaRE Center enhancements: – Data sharing strategy – Phenotype/Trait thesaurus, meta thesaurus – Customizable analytic pipelines GAP by the Numbers… • 510,000 lines of working source code • Very large databases, one of the largest tables has 640 M rows • Standard industry metrics (SLOCCount) estimate that this code base required – $18M to develop (actual: ~ $4.5M) – Staff of 40 for 3.5 years (actual: average 12/yr) • Pretty good deal! – Truly a World Class informatics team GAP by the #’s (cont’d.) • User statistics: – 222 logins for the system (internal & external) – Nina has trained (since 12/05): • 47 people individually • Held 25 small group sessions • Jira statistics: – 1,722 issues logged since Jira in production – 1,330 resolved – 392 open/in progress – Daniel Mirel is the champion Jira user GAP by the #’s (cont’d) • Informatics staff statistics: – Total software dev experience: ~200 years – Degrees held: • 1 PhD (neuroscience) • 6 masters degrees – 4 comp sci, 1 molbio, 1 manufacturing • 4 eng bachelor degrees – 2 EE’s, 1 ChemEng, 1 SoftEng • 3 biochem/molbio/biology bachelor degrees • 4 comp sci bachelor degrees • 1 physics bachelor degree User Workflow in Software Samples & Clinical Information Genome Sequence & Genetic Variation Biological Purchased HapMap Sample NCBI SNPs dbSNP Platform DCC (Celera) Project Plan Experiment in Management Project Management Execute Execute Genotyping Resequencing Genotyping Resequencing Pipeline (ESP) Pipeline Experiment Experiment Genetic Analysis Perform analysis and loop back to next round of experiment planning CARE Association Study Workflow Production: Analysis: Gene Pattern + Sample Mgt, Project Mgt, Genotyping custom analysis tools Upload Samples, Peds, Individuals, Sample Data Compile Phenotypes DB Web Services Summarize/Filter Project PLINK Create Experiments DB (Samples x Features) Association & Feature Statistics Viewers DB Design and Execute LIMS DBs Custom Experiments Algorithms, Viewers QC/Curate Results Data Vault Phenotype Component Conceptual Architecture Thesauri, Meta Thesaurus for CARE Controlled Vocabulary Constraints One ontology – either Group or Project Base Phenotype Inquiry specified Component Phenotype Capture and Validation CARE Progress Report • PhenoMall functionality – Rapid enhancement of capture function • Meta data • Mapping of all CARE phenotypes looks good – Major enhancements for pheno inquiry – Informatics goals of pilot • Figure out how far up the controlled vocab/thesaurus stack we need to go • What curation tools are needed? • Requirements gathering beyond pilot • Awaiting the decision on data sharing… Deliverables • NIH Application/System Security Plan – Two major revisions, July 17th and Oct 16th – Security officer at NHLBI is Cindy Walczak • When data sharing model decided: – Research technologies, approaches, make recommendation to subcommittee – Spec/design and review by subcommittee • Working pilot – Need to discuss when to demo – Feb meeting in Bethesda?? Security Considerations Security Layers - General • There are at least three levels: – MIT firewalls • Penetration testing, Tripwire, packet monitoring, etc. – Broad • New Cisco firewalls • Route to host servers – Explicit Allows only • Wireless access goes out to MIT firewall • Open jack goes to Broad firewall – CARE Center application itself The World MIT The Broad Institute Firewalls On LIMS Used for authentication for Cisco VPN access ASA 5540 Host A Internet MIT Radius Core “Cloud” DB Router Cisco Host B ASA 5540 Host on server Access Rules … for Subnets: Explicit allows, Allow Rules: e.g., allow host Explicit allows – on LIMS to talk to http = 80 -> host host on server Ssh = 22 -> host https = 443 (SSL) Must be in the list to permit access Unregistered 10.10 Open jack domain Wireless Security Layers - Application • Genetic Analysis Platform application security: – Role-based security – Passwords that expire – Audit trails track user activity • Detailed information available in NIH Application/System Security Plan for CARE Center Summary: Issues/Questions • Scope of phenotype-related enhancements • Group/Project structure for CaRE Center • CaRE user visibility into Process Dashboard/LIMS • Data release model decision – Data Enclave scenarios and security • User training and doco – Analysis methodology – System and security training Security for Production & Analysis Users in JAAS domain BSP Lab CaRE Technician Scientist CaRE Cohort Biological Samples Technician Project Analysis Management Platform Pipelines BSP Security Context Proj Mgt (Sample Collection) CaRE Analysis Groups, Security Security Context Projects, Context (Scope based on rules Grants, (Project) of Data Enclave, could Panels, cover multiple Feature Sets, Projects) Sample Sets Shareable Objects: Peds, Individuals, Phenotypes, Samples, Features LSIDs Process/ LIMS Lab Security PIPS DB Feature DB Broad Lab Technician, Context Coordinator (X-Project) Postlude How Users Can Help • Specify! We need things nailed down… • The classic specification: – Genesis 6:14 - 16 (NKJV) 14 "Make yourself an ark of gopherwood; make rooms in the ark, and cover it inside and outside with pitch. 15 "And this is how you shall make it: The length of the ark shall be three hundred cubits, its width fifty cubits, and its height thirty cubits. 16 "You shall make a window for the ark, and you shall finish it to a cubit from above; and set the door of the ark in its side. You shall make it with lower, second, and third decks • We live in the world of 0’s and 1’s! Informatics Development Team Jason Carey James Nemesh (CH) Kristian Cibulskis Huy Nguyen Michael Dinsmore Howard Rafal Tim Fennell Greg Rushton George Grant Dennis Ryan Bob Handsaker David Tefft Nina Lapchyk Alex Thomson Pei Lin Ellen Winchester Alec Wysoker Names in bold have significant time allocated to CARE center activity.