Open Innovation
Shared by: hedongchenchen
-
Stats
- views:
- 2
- posted:
- 12/3/2011
- language:
- English
- pages:
- 62
Document Sample


The Integration of Biological Data
Using Semantic Web Technologies
Agenda
• Introduction to Semantic Web
• Semantic Solutions at Lilly
• W3C’s Semantic Web for Health Care and Life
Sciences Interest Group
Introduction to the Semantic Web
Drivers for the Semantic Web
• Business models develop rapidly these days, so infrastructure
that supports change is needed
• Organizations are increasingly forming and disbanding
collaborations
• Data is growing so quickly that it is no longer possible for
individuals to identify patterns in their heads
• Increasing recognition of the benefits of collective intelligence
Characterizing the Semantic Web
• Semantic Web is an interoperability technology
• An architecture for interconnected communities and
vocabularies
• A set of interoperable standards for knowledge exchange
Creating a Web of Data
Applications
Graph representation
Data in various formats
Source: Ivan Herman
Resource Description Framework (RDF)
Source: W3C
RDFS and OWL
• RDFS
• Is a simple vocabulary for describing properties and classes of RDF
resources
• Provides semantics for hierarchies of properties and classes
• Designed to support inferencing
• OWL
• Explicitly represents meaning of terms in vocabularies and the
relationships between those terms
• Separate layers have been defined balancing expressibility vs.
implementability (OWL Lite, OWL DL, OWL Full)
• Supports inferencing
SPARQL as a Unifying Source
Source: Ivan Herman
Semantic Web Solutions at Lilly
Discovery Metadata: Goals
• Integrate master data throughout the discovery
process to enable information sharing/integration for
the scientific community
• Model key relationships between master data classes
• Provide ability to integrate disparate data sets quicker than the
normal warehouse paradigm typically allows
• Create a re-usable and sustainable semantic implementation
• Allow for user-driven, manual curation of key data
relationships
Discovery Metadata: Ontology
SAP Legacy
REFDB GSM
Manual
NCBI Curation
Discovery Metadata: Architecture
A
P Application 1 Application 2 Application 3 …
P
S
S SOA Layer/Enterprise Service Bus
O Authentication
(WebServices, Visualizers, DataAccess Components )
A
SQL SPARQL
ETL
D
A
Source Source Source Source Other
Other
Source Local Top Level
T Model 1 Model 2 Model 3 Model 4 …
Sources
Sources Assertions Ontology
Provenance
A Other
Tools
Spreadsheets
Rdbm s
CATIE: Overview
• Clinical Antipsychotic Trials of Intervention
Effectiveness (CATIE)
• Was the most comprehensive independent trial ever
completed to examine existing anti-psychotic therapies
for schizophrenia
• Provides detailed information comparing the
effectiveness and side effects of five medications
currently used to treat schizophrenia
• Greatly enhances the knowledge available to guide
treatment choices for people with schizophrenia
CATIE: Goals
• Determine whether semantic integration and analysis
of the CATIE data set in the context of metabolic and
signal transduction pathways with receptor affinities
can provide answers to specific scientific questions:
• Which pathways are associated with response to the 5
different schizophrenia drugs?
• How do these pathways compare between treatment arms?
• Which receptors are associated with response to the 5
schizophrenia drugs?
• How are the pathways, receptors and the drug response
genes from the CATIE data set related?
CATIE: Drugs and Data Sets
• CATIE Drugs:
• Olanzapine
• Perphenazine
• Quetiapine
• Risperidone
• Ziprasidone
• Datasets:
• Entrez Gene
• Pubchem
• Assay (Receptor Affinity Data)
• KEGG
• Reactome
• Biocyc
• Transpath
CATIE: Architecture
Data in Multiple Formats Top – Level Ontology
(Flat file, Tab limited, XML)
RDF integrated using Jena
Programming API
Allegrograph
Oracle 11g
Native RDF
RDF store
Triple Store
Perform SPARQL Perform SPARQL
Querying Querying
CATIE: Conclusions
• Efficient semantic integration can be accomplished by
using RDF
• Powerful complex data modeling can be achieved by
using graph principles inherent in RDF
• Easy translation of scientific questions to graph
queries using SPARQL and SEM_MATCH
• Customized outputs can easily be generated by
making slight changes in the SPARQL query pattern
Competitive Intelligence: Overview
• Competitive Intelligence is a purposeful, ethical and
co-coordinated monitoring of the competitors in any
industry within a specific market place to:
• Strategically gain foreknowledge of recent developments of
your competitor's plans
• Make calculated informed business decisions and formulate
operational strategy
• Provide a mechanism for actively surveying the public
information for competitive intelligence in Endocrine
Competitive Intelligence: Goals
• Does such a CI effort significantly benefit from a
semantic component?
• Does the project significantly benefit from semantic
integration?
• Are there pre-existing ontologies for company and
method of action domains?
• Does NLP or text mining work for this kind of data?
• Does “buried” knowledge exist within datasets that can
be discovered using inference and reasoning?
Competitive Intelligence: Integration Challenges
Syntactic Variations Parent – Child Relations
Merck & Co
Merck & Co Inc
Company
Merck
Merck & Co Ltd
Alpha-glucosidase inhibitor
Glucosidase inhibitor alpha
MOA
IGF binding protein-3 stimulator
IGF binding protein stimulator-3
Semantic Variations
Amgen Boulder Inc
Applied Molecular Genetics Inc
Company
Synergen Inc
Amgen
Serotonin 2A receptor antagonists
5-HT 2 receptor antagonist
5-HT2a antagonist
Peroxisome proliferator-activated receptor delta antagonist
MOA
PPAR delta antagonist
Melanin concentrating hormone receptor 1 antagonists
MCH receptor-1 antagonist
Competitive Intelligence: NLP
Terms from
Raw Endocrine Data
Thomson – Pharma
Bayer Corp Bayer Corp
Dopamide receptor agonist Dopamide receptor agonist
NLP Methods Used:
SGLT inhibitor SGLT inhibitor
• Semantic Normalization
Increase in Complexity
Eli Lilly Eli Lilly & Co Ltd • Fuzzy Distance
STAT transcription factor stimulant STAT stimulator • Ignoring Stop Words
Alpha-glucosidase inhibitors Glucosidase inhibitor-alpha
• Regular Expressions
Peroxisome proliferator-activated PPAR delta antagonist • Tokenization
receptor delta antagonist
• Rule-based Mapping
5 Hydroxytryptamine 2C agonist 5HT 1c agonist
Opioid kappa receptor antagonists Kappa opioid antagonist
Serotonin 1B receptor agonists 5-HT 1d beta agonist
Competitive Intelligence: Data Model
Melanin-concentrating hormone
receptor antagonists
GPR-24
antagonist
hasSubClass
rdf:type Melanin concentrating hormone
MOA receptor 1 antagonists
hasMOA
MCH 1 MCH receptor-1
hasDrug
antagonists antagonist
Drug
hasStatus Blank
Phase 2
Node
Applied Molecular
hasCompa
Disease
Genetics Inc
hasTherapeuticArea
ny
rdf:type
Obesity
rdf:type alternativeLabel Amgen Boulder
Company Amgen Inc
Synergen Inc
Abgenix
Avidia Inc
Competitive Intelligence: Inferencing
Given Company Name: Applied Molecular Genetics Inc Get MOA’s that this company is working on
PREFIX TLO: <http://www.lilly.com/POC/TopLevel-ontology#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
Select Distinct ?Endo_MOA ?Company_All_Labels
Where{
?Company_Res TLO:SynonymousLabels "Applied Molecular Genetics Inc"^^xsd:string.
?Company_Res TLO:preferredLabel ?Company_Pref_Label .
?Company_Res TLO:SynonymousLabels ?Company_All_Labels .
?Drug_Info TLO:hasAssociatedCompany ?Company_All_Labels .
?Drug_Info TLO:hasMOA ?Endo_MOA
}
Case1 : ‘Without’ Semantic Integration and Inference Case2 : ‘With’ Semantic Integration and Inference
‘0’ Results Leptin stimulator Amgen
Agouti related protein inhibitor Amgen Inc 18 Results
Neuropeptide Y antagonist Amgen
Melanocortin MC4 antagonist Amgen Inc
Competitive Intelligence: Conclusions
• Semantic Integration (instance mapping using NLP) coupled with
RDF data model was successful in answering questions in
Competitive Intelligence
• Ontologies provide a powerful framework in providing dictionaries
and taxonomical relations that help to reason and inference the
data for knowledge discovery
• Manual curation is a tedious, error prone and labor intensive-task
• A semi-automated computer-based solution that utilizes
ontologies, semantic integration and NLP could drastically reduce
manual curation process and maintain high quality information
Metadata Repository: Goals
• Aggregate experiment metadata from diverse relational
databases into an Oracle 11g for scientific investigation
• Provide a unified vocabulary for scientific investigation
• Avoid a complex architecture and extended development effort
• Realize benefits in the near-term
• Preprocess metadata to improve efficiency
• Characterize the type of questions that ontology should answer
• Identify stable semantic technologies, do not employ parsers
• Allow semantic and relational databases to work together
• Provide browser, visualization, and query access into repository
Metadata Repository: Ontology
Project hasProject Study hasStudy Experiment
hasPlate
hasDiseaseState
Assay hasAssay hasChip
DiseaseState Plate
hasProtocol
Chip
Protocol
Compound hasPlate
Sample
hasChipType hasSample
subclass subclass
Plate hasModel
Reagent Chip Type
Software Hardware Well hasTissue
hasCellline
hasCompound
subclass Model
subclass
hasReagent
hasChipType Tissue
subclass hasGene hasTreatment
DNA CellLine
Reagent Treatment
RNA Reagent hasSource hasSourceTissue
Protein hasGene Probe
Reagent hasReagent
Gene IsPartOf ClinicalData
ViralBatch
GeneList
hasMESHId
hasGOId
MESH
GO
Metadata Repository: Architecture
• Iterative queries on
metadata defined items
Query Visualization
• Metadata and raw
data are aggregated to
provide additional
context for analysis Experimental
Annotation
Metadata
Services
Repository
Agilent RNAi
aCGH TMA
Expression Database
Affy Illumina Mutation Analysis
Screening
Expression Expression SNP Results
Metadata Repository: Implementation
• Protégé Ontology Editor
• Oracle Semantic Technologies 11g
• D2R Map (Database to RDF Mapping)
• C# development in Visual Studio 2005
• Current data sources include:
• Expression Data : Affymetrix, Illumina, Agilent
• aCGH Data
• RNAi Screening Data
• Reagent Data
• Gene Ontology (GO)
• Medical Subject Headings (MeSH)
• Currently ~30 million triples
Metadata Repository: Conclusion
• It’s now possible for users to ask questions such as:
• Get all the interactions for methylases that are involved in
Colon cancer. For all these genes, get the expression and
aCGH values for all LSCDD colon cancer samples
• Find cell lines in which RNAi data has been generated
using Dharmacon reagents
• Retrieve the antibodies that have been used to assess the
AKT1 pathway activity in MCF7
• Find all the experiments that were done using my sample
• Find all samples which are grade III colorectal cancer. For
these sample, retrieve the expression, mutation and
aCGH data
External Collaborations
• RDF Access to Relational Databases - Chris Bizer, Eric Prud'hommeaux
• Scalability testing of relational to RDF mapping approaches
• End User Semantic Web Authoring - David Karger
• Enhancing the scalability and robustness of the Exhibit and Potluck tools
• Scientist-Driven Semantic Integration of Knowledge in Alzheimer's
Disease - Tim Clark, June Kinoshita
• Project to develop an integrated knowledge infrastructure for the neuromedical
research community, pairing rich digital semantic context with the ever-growing
digital scientific content on the web
• Provenance Collection and Management - Carole Goble, Beth Plale
• Project to develop a metadata taxonomy for global data at Lilly which enables
the rapid integration of data and mining/analysis algorithms into dataflows
which support clinical and discovery decisions
• W3C’s Health Care and Life Sciences Interest Group
Conclusion
• Semantic Web provides a flexible framework for data
integration
• Data integration needs (and issues) abound at Lilly
• Lilly is seeing tangible benefits in multiple projects
from semantic Web
• Focus on incremental adoption of the technology
• Tools are improving, but more work is needed
• Lilly use of Semantic Web technology isn’t atypical in
health care and life sciences organizations
W3C Semantic Web for Health Care
and Life Sciences Interest Group
What is the Mission of HCLS IG?
The mission of HCLS is to develop, advocate for, and
support the use of Semantic Web technologies for
biological science, translational medicine and health
care. These domains stand to gain tremendous benefit
by adoption of Semantic Web technologies, as they
depend on the interoperability of information from many
domains and processes for efficient decision support.
Task Forces
• Terminology – Semantic Web representation of existing resources
• Task lead - John Madden
• BioRDF – integrated neuroscience knowledge base
• Task lead - Kei Cheung
• Linking Open Drug Data – aggregation of Web-based drug data
• Task lead - Chris Bizer
• Scientific Discourse – building communities through networking
• Task leads - Tim Clark, John Breslin
• Clinical Observations Interoperability – patient recruitment in trials
• Task lead - Vipul Kashyap
• Other Projects: Clinical Decision Support, URI Workshop,
Collaborations with CDISC & HL7
Terminology Task Force
Task Lead: John Madden
Participants: Chimezie Ogbuji, Helen Chen, Holger
Stenzhorn, Mary Kennedy, Xiashu Wang, Rob Frost,
Jonathan Borden, Guoqian Jiang
Terminology: Overview
• Goal is to identify use cases and methods for extracting
Semantic Web representations from existing, standard
medical record terminologies, e.g. UMLS
• Methods should be reproducible and, to the extent
possible, not lossy
• Identify and document issues along the way related to
identification schemes, expressiveness of the relevant
languages
• Initial effort will start with SNOMED-CT and UMLS
Semantic Networks and focus on a particular sub-
domain (e.g. pharmacological classification)
BioRDF Task Force
Task Lead: Kei Cheung
Participants: Scott Marshall, Eric Prud’hommeaux,
Susie Stephens, Andrew Su, Steven Larson, Huajun
Chen, TN Bhat, Matthias Samwald, Erick Antezana,
Rob Frost, Ward Blonde, Holger Stenzhorn, Don
Doherty
BioRDF: Answering Questions
Goals: Get answers to questions posed to a body of
collective knowledge in an effective way
Knowledge used: Publicly available databases, and text
mining
Strategy: Integrate knowledge using careful modeling,
exploiting Semantic Web standards and technologies
BioRDF: Looking for Targets for Alzheimer’s
• Signal transduction pathways are
considered to be rich in “druggable”
targets
• CA1 Pyramidal Neurons are
known to be particularly damaged
in Alzheimer’s disease
• Casting a wide net, can we find
candidate genes known to be
involved in signal transduction and
active in Pyramidal Neurons?
Source: Alan Ruttenberg
BioRDF: Integrating Heterogeneous Data
PDSPki
Gene Reactome NeuronDB
Ontology
BAMS
Antibodies Allen Brain BrainPharm
Entrez Atlas
Gene MESH
Literature PubChem
Mammalian
Phenotype
SWAN
AlzGene
Homologene
Source: Susie Stephens
BioRDF: SPARQL Query
Source: Alan Ruttenberg
BioRDF: Results: Genes, Processes
DRD1, 1812 adenylate cyclase activation
ADRB2, 154 adenylate cyclase activation
ADRB2, 154 arrestin mediated desensitization of G-protein coupled receptor protein signaling pathway
DRD1IP, 50632 dopamine receptor signaling pathway
DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway
DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway
GRM7, 2917 G-protein coupled receptor protein signaling pathway
GNG3, 2785 G-protein coupled receptor protein signaling pathway
GNG12, 55970
DRD2, 1813
G-protein coupled receptor protein signaling pathway
G-protein coupled receptor protein signaling pathway
Many of the genes
ADRB2, 154
CALM3, 808
G-protein coupled receptor protein signaling pathway
G-protein coupled receptor protein signaling pathway
are related to AD
HTR2A, 3356
DRD1, 1812
G-protein coupled receptor protein signaling pathway
G-protein signaling, coupled to cyclic nucleotide second messenger through gamma
SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second messenger
MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide second messenger secretase
CNR2, 1269 G-protein signaling, coupled to cyclic nucleotide second messenger
HTR6, 3362
GRIK2, 2898
G-protein signaling, coupled to cyclic nucleotide second messenger
glutamate signaling pathway
(presenilin) activity
GRIN1, 2902 glutamate signaling pathway
GRIN2A, 2903 glutamate signaling pathway
GRIN2B, 2904 glutamate signaling pathway
ADAM10, 102 integrin-mediated signaling pathway
GRM7, 2917 negative regulation of adenylate cyclase activity
LRP1, 4035 negative regulation of Wnt receptor signaling pathway
ADAM10, 102 Notch receptor processing
ASCL1, 429 Notch signaling pathway
HTR2A, 3356 serotonin receptor signaling pathway
ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)
PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway
EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway
NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway
CTNND1, 1500 Wnt receptor signaling pathway
Source: Alan Ruttenberg
LODD Task Force
Task Lead: Chris Bizer
Participants: Anja Jentzsch, Kristin Tolle, Eric
Prud’hommeaux, Don Doherty, Susie Stephens, Bosse
Andersson, Scott Marshall, Glen Newton, Michel
Dumontier, TN Bhat, Oktie Hassanzadeh
LODD: Introduction
Use Semantic Web technologies to
1. publish structured data on the Web
2. set links between data from one data source to data within other data sources
Linked Data Linked Data Search
Browsers Mashups Engines
Thing Thing Thing Thing Thing
Thing Thing Thing Thing Thing
typed typed typed typed
links links links links
A B C D E
Source: Chris Bizer
LODD: Potential Links between Data Sets
Source: Chris Bizer
LODD: Data Set Evaluation
Source: Chris Bizer
LODD: Potential questions to answer
• Physicians and Pharmacists
• What are alternative drugs for a given indication (disease)?
• What are equivalent drugs (generic version of a brand name, or the
chemical name of a active ingredient)?
• Are there ongoing clinical trials for a drug?
• Patients
• What background information is available about a drug?
• What are the contraindications of a drug?
• Which alternative drugs are available?
• What are the results of clinical trials for a drug?
• Pharmaceutical Companies
• What are other companies with drugs in similar areas?
• Which companies have a similar therapeutic focus?
Source: Chris Bizer
LODD: Linked Version of ClinicalTrials.gov
• Total number of triples:
6,998,851
• Number of Trials:
61,920
• RDF links to other data
sources: 177,975
• Links to:
• DBpedia and YAGO
(from intervention and conditions)
• GeoNames (from locations)
• Bio2RDF.org's PubMed (from references)
Source: Chris Bizer
LODD: Mashing Clinical Trials and Geo
Classification
of Places
Geo
Coordinates
Source: Chris Bizer
Scientific Discourse Task Force
Task Lead: Tim Clark, John Breslin
Participants: Uldis Bojars, Paolo Ciccarese, Sudeshna
Das, Ronan Fox, Tudor Groza, Christoph Lange,
Matthias Samwald, Elizabeth Wu, Holger Stenzhorn,
Marco Ocana, Kei Cheung, Alexandre Passant
Scientific Discourse: Overview
Source: Tim Clark
Scientific Discourse: Goals
• Provide a Semantic Web platform for scientific
discourse in biomedicine
• Linked to
– key concepts, entities and knowledge
• Specified
– by ontologies
• Integrated with
– existing software tools
• Useful to
– Web communities of working scientists
Source: Tim Clark
Scientific Discourse: Some Parameters
• Discourse categories: research questions, scientific assertions
or claims, hypotheses, comments and discussion, and evidence
• Biomedical categories: genes, proteins, antibodies, animal
models, laboratory protocols, biological processes, reagents,
disease classifications, user-generated tags, and bibliographic
references
• Driving biological project: cross-application of discoveries,
methods and reagents in stem cell, Alzheimer and Parkinson
disease research
• Informatics use cases: interoperability of web-based research
communities with (a) each other (b) key biomedical ontologies (c)
algorithms for bibliographic annotation and text mining (d) key
resources
Source: Tim Clark
Scientific Discourse: SWAN+SIOC
• SIOC
• Represent activities and contributions of online communities
• Integration with blogging, wiki and CMS software
• Use of existing ontologies, e.g. FOAF, SKOS, DC
• SWAN
• Represents scientific discourse (hypotheses, claims, evidence,
concepts, entities, citations)
• Used to create the SWAN Alzheimer knowledge base
• Active beta participation of 144 Alzheimer researchers
• Ongoing integration into SCF Drupal toolkit
Source: Tim Clark
COI Task Force
Task Lead: Vipul Kashap
Participants: Eric Prud’hommeaux, Helen Chen,
Jyotishman Pathak, Rachel Richesson, Holger
Stenzhorn
COI: Bridging Bench to Bedside
• How can existing Electronic Health Records (EHR)
formats be reused for patient recruitment?
• Quasi standard formats for clinical data:
• HL7/RIM/DCM – healthcare delivery systems
• CDISC/SDTM – clinical trial systems
• How can we map across these formats?
• Can we ask questions in one format when the data is represented in
another format?
Source: Holger Stenzhorn
COI: Use Case
Pharmaceutical companies pay a lot to test drugs
Pharmaceutical companies express protocol in CDISC
-- precipitous gap –
Hospitals exchange information in HL7/RIM
Hospitals have relational databases
Source: Eric Prud’hommeaux
Inclusion Criteria
Type 2 diabetes on diet and exercise therapy or
monotherapy with metformin, insulin
secretagogue, or alpha-glucosidase inhibitors, or
a low-dose combination of these at 50%
maximal dose. Dosing is stable for 8 weeks prior
to randomization.
…
?patient takes meformin .
Source: Holger Stenzhorn
Exclusion Criteria
Use of warfarin (Coumadin), clopidogrel
(Plavix) or other anticoagulants.
…
?patient doesNotTake anticoagulant .
Source: Holger Stenzhorn
Criteria in SPARQL
?medication1 sdtm:subject ?patient ;
spl:activeIngredient ?ingredient1 .
?ingredient1 spl:classCode 6809 . #metformin
OPTIONAL {
?medication2 sdtm:subject ?patient ;
spl:activeIngredient ?ingredient2 .
?ingredient2 spl:classCode 11289 .
#anticoagulant
} FILTER (!BOUND(?medication2))
Source: Holger Stenzhorn
Getting Involved
• Benefits to getting involved include:
• early access to use cases and best practice
• influence standard recommends
• cost effective exploration of new technology through collaboration
• Get involved by contacting the chairs:
• team-hcls-chairs@w3.org
Get documents about "