Embed
Email

Open Innovation

Document Sample

Shared by: hedongchenchen
Categories
Tags
Stats
views:
0
posted:
12/3/2011
language:
English
pages:
62
The Integration of Biological Data

Using Semantic Web Technologies

Agenda

• Introduction to Semantic Web

• Semantic Solutions at Lilly



• W3C’s Semantic Web for Health Care and Life

Sciences Interest Group

Introduction to the Semantic Web

Drivers for the Semantic Web

• Business models develop rapidly these days, so infrastructure

that supports change is needed



• Organizations are increasingly forming and disbanding

collaborations



• Data is growing so quickly that it is no longer possible for

individuals to identify patterns in their heads



• Increasing recognition of the benefits of collective intelligence

Characterizing the Semantic Web

• Semantic Web is an interoperability technology



• An architecture for interconnected communities and

vocabularies



• A set of interoperable standards for knowledge exchange

Creating a Web of Data





Applications









Graph representation









Data in various formats







Source: Ivan Herman

Resource Description Framework (RDF)









Source: W3C

RDFS and OWL

• RDFS

• Is a simple vocabulary for describing properties and classes of RDF

resources

• Provides semantics for hierarchies of properties and classes

• Designed to support inferencing



• OWL

• Explicitly represents meaning of terms in vocabularies and the

relationships between those terms

• Separate layers have been defined balancing expressibility vs.

implementability (OWL Lite, OWL DL, OWL Full)

• Supports inferencing

SPARQL as a Unifying Source









Source: Ivan Herman

Semantic Web Solutions at Lilly

Discovery Metadata: Goals

• Integrate master data throughout the discovery

process to enable information sharing/integration for

the scientific community

• Model key relationships between master data classes

• Provide ability to integrate disparate data sets quicker than the

normal warehouse paradigm typically allows

• Create a re-usable and sustainable semantic implementation

• Allow for user-driven, manual curation of key data

relationships

Discovery Metadata: Ontology









SAP Legacy



REFDB GSM

Manual

NCBI Curation

Discovery Metadata: Architecture

A

P Application 1 Application 2 Application 3 …

P

S





S SOA Layer/Enterprise Service Bus

O Authentication

(WebServices, Visualizers, DataAccess Components )

A





SQL SPARQL

ETL



D

A

Source Source Source Source Other

Other

Source Local Top Level

T Model 1 Model 2 Model 3 Model 4 …

Sources

Sources Assertions Ontology

Provenance

A Other

Tools









Spreadsheets

Rdbm s

CATIE: Overview

• Clinical Antipsychotic Trials of Intervention

Effectiveness (CATIE)

• Was the most comprehensive independent trial ever

completed to examine existing anti-psychotic therapies

for schizophrenia

• Provides detailed information comparing the

effectiveness and side effects of five medications

currently used to treat schizophrenia

• Greatly enhances the knowledge available to guide

treatment choices for people with schizophrenia

CATIE: Goals

• Determine whether semantic integration and analysis

of the CATIE data set in the context of metabolic and

signal transduction pathways with receptor affinities

can provide answers to specific scientific questions:

• Which pathways are associated with response to the 5

different schizophrenia drugs?

• How do these pathways compare between treatment arms?

• Which receptors are associated with response to the 5

schizophrenia drugs?

• How are the pathways, receptors and the drug response

genes from the CATIE data set related?

CATIE: Drugs and Data Sets

• CATIE Drugs:

• Olanzapine

• Perphenazine

• Quetiapine

• Risperidone

• Ziprasidone



• Datasets:

• Entrez Gene

• Pubchem

• Assay (Receptor Affinity Data)

• KEGG

• Reactome

• Biocyc

• Transpath

CATIE: Architecture

Data in Multiple Formats Top – Level Ontology

(Flat file, Tab limited, XML)









RDF integrated using Jena

Programming API









Allegrograph

Oracle 11g

Native RDF

RDF store

Triple Store





Perform SPARQL Perform SPARQL

Querying Querying

CATIE: Conclusions

• Efficient semantic integration can be accomplished by

using RDF



• Powerful complex data modeling can be achieved by

using graph principles inherent in RDF



• Easy translation of scientific questions to graph

queries using SPARQL and SEM_MATCH



• Customized outputs can easily be generated by

making slight changes in the SPARQL query pattern

Competitive Intelligence: Overview

• Competitive Intelligence is a purposeful, ethical and

co-coordinated monitoring of the competitors in any

industry within a specific market place to:

• Strategically gain foreknowledge of recent developments of

your competitor's plans

• Make calculated informed business decisions and formulate

operational strategy



• Provide a mechanism for actively surveying the public

information for competitive intelligence in Endocrine

Competitive Intelligence: Goals

• Does such a CI effort significantly benefit from a

semantic component?



• Does the project significantly benefit from semantic

integration?



• Are there pre-existing ontologies for company and

method of action domains?



• Does NLP or text mining work for this kind of data?



• Does “buried” knowledge exist within datasets that can

be discovered using inference and reasoning?

Competitive Intelligence: Integration Challenges

Syntactic Variations Parent – Child Relations

Merck & Co



Merck & Co Inc

Company

Merck



Merck & Co Ltd



Alpha-glucosidase inhibitor



Glucosidase inhibitor alpha



MOA

IGF binding protein-3 stimulator



IGF binding protein stimulator-3





Semantic Variations

Amgen Boulder Inc



Applied Molecular Genetics Inc

Company

Synergen Inc



Amgen



Serotonin 2A receptor antagonists



5-HT 2 receptor antagonist



5-HT2a antagonist







Peroxisome proliferator-activated receptor delta antagonist

MOA

PPAR delta antagonist







Melanin concentrating hormone receptor 1 antagonists



MCH receptor-1 antagonist

Competitive Intelligence: NLP

Terms from

Raw Endocrine Data

Thomson – Pharma



Bayer Corp Bayer Corp

Dopamide receptor agonist Dopamide receptor agonist

NLP Methods Used:

SGLT inhibitor SGLT inhibitor

• Semantic Normalization

Increase in Complexity









Eli Lilly Eli Lilly & Co Ltd • Fuzzy Distance

STAT transcription factor stimulant STAT stimulator • Ignoring Stop Words

Alpha-glucosidase inhibitors Glucosidase inhibitor-alpha

• Regular Expressions

Peroxisome proliferator-activated PPAR delta antagonist • Tokenization

receptor delta antagonist

• Rule-based Mapping

5 Hydroxytryptamine 2C agonist 5HT 1c agonist



Opioid kappa receptor antagonists Kappa opioid antagonist



Serotonin 1B receptor agonists 5-HT 1d beta agonist

Competitive Intelligence: Data Model

Melanin-concentrating hormone

receptor antagonists

GPR-24

antagonist

hasSubClass





rdf:type Melanin concentrating hormone

MOA receptor 1 antagonists









hasMOA

MCH 1 MCH receptor-1

hasDrug

antagonists antagonist

Drug



hasStatus Blank

Phase 2

Node

Applied Molecular









hasCompa

Disease

Genetics Inc

hasTherapeuticArea









ny

rdf:type

Obesity





rdf:type alternativeLabel Amgen Boulder

Company Amgen Inc









Synergen Inc

Abgenix

Avidia Inc

Competitive Intelligence: Inferencing

Given Company Name: Applied Molecular Genetics Inc Get MOA’s that this company is working on







PREFIX TLO:

PREFIX xsd:

Select Distinct ?Endo_MOA ?Company_All_Labels

Where{

?Company_Res TLO:SynonymousLabels "Applied Molecular Genetics Inc"^^xsd:string.

?Company_Res TLO:preferredLabel ?Company_Pref_Label .

?Company_Res TLO:SynonymousLabels ?Company_All_Labels .

?Drug_Info TLO:hasAssociatedCompany ?Company_All_Labels .

?Drug_Info TLO:hasMOA ?Endo_MOA

}









Case1 : ‘Without’ Semantic Integration and Inference Case2 : ‘With’ Semantic Integration and Inference



‘0’ Results Leptin stimulator Amgen



Agouti related protein inhibitor Amgen Inc 18 Results

Neuropeptide Y antagonist Amgen



Melanocortin MC4 antagonist Amgen Inc

Competitive Intelligence: Conclusions

• Semantic Integration (instance mapping using NLP) coupled with

RDF data model was successful in answering questions in

Competitive Intelligence



• Ontologies provide a powerful framework in providing dictionaries

and taxonomical relations that help to reason and inference the

data for knowledge discovery



• Manual curation is a tedious, error prone and labor intensive-task



• A semi-automated computer-based solution that utilizes

ontologies, semantic integration and NLP could drastically reduce

manual curation process and maintain high quality information

Metadata Repository: Goals

• Aggregate experiment metadata from diverse relational

databases into an Oracle 11g for scientific investigation

• Provide a unified vocabulary for scientific investigation

• Avoid a complex architecture and extended development effort

• Realize benefits in the near-term

• Preprocess metadata to improve efficiency

• Characterize the type of questions that ontology should answer

• Identify stable semantic technologies, do not employ parsers

• Allow semantic and relational databases to work together

• Provide browser, visualization, and query access into repository

Metadata Repository: Ontology

Project hasProject Study hasStudy Experiment



hasPlate

hasDiseaseState

Assay hasAssay hasChip

DiseaseState Plate

hasProtocol

Chip

Protocol

Compound hasPlate

Sample

hasChipType hasSample

subclass subclass



Plate hasModel

Reagent Chip Type

Software Hardware Well hasTissue

hasCellline

hasCompound

subclass Model

subclass

hasReagent

hasChipType Tissue

subclass hasGene hasTreatment

DNA CellLine

Reagent Treatment

RNA Reagent hasSource hasSourceTissue



Protein hasGene Probe

Reagent hasReagent

Gene IsPartOf ClinicalData

ViralBatch

GeneList

hasMESHId

hasGOId

MESH

GO

Metadata Repository: Architecture

• Iterative queries on

metadata defined items

Query Visualization

• Metadata and raw

data are aggregated to

provide additional

context for analysis Experimental

Annotation

Metadata

Services

Repository









Agilent RNAi

aCGH TMA

Expression Database



Affy Illumina Mutation Analysis

Screening

Expression Expression SNP Results

Metadata Repository: Implementation

• Protégé Ontology Editor

• Oracle Semantic Technologies 11g

• D2R Map (Database to RDF Mapping)

• C# development in Visual Studio 2005

• Current data sources include:

• Expression Data : Affymetrix, Illumina, Agilent

• aCGH Data

• RNAi Screening Data

• Reagent Data

• Gene Ontology (GO)

• Medical Subject Headings (MeSH)



• Currently ~30 million triples

Metadata Repository: Conclusion

• It’s now possible for users to ask questions such as:

• Get all the interactions for methylases that are involved in

Colon cancer. For all these genes, get the expression and

aCGH values for all LSCDD colon cancer samples

• Find cell lines in which RNAi data has been generated

using Dharmacon reagents

• Retrieve the antibodies that have been used to assess the

AKT1 pathway activity in MCF7

• Find all the experiments that were done using my sample

• Find all samples which are grade III colorectal cancer. For

these sample, retrieve the expression, mutation and

aCGH data

External Collaborations

• RDF Access to Relational Databases - Chris Bizer, Eric Prud'hommeaux

• Scalability testing of relational to RDF mapping approaches



• End User Semantic Web Authoring - David Karger

• Enhancing the scalability and robustness of the Exhibit and Potluck tools



• Scientist-Driven Semantic Integration of Knowledge in Alzheimer's

Disease - Tim Clark, June Kinoshita

• Project to develop an integrated knowledge infrastructure for the neuromedical

research community, pairing rich digital semantic context with the ever-growing

digital scientific content on the web



• Provenance Collection and Management - Carole Goble, Beth Plale

• Project to develop a metadata taxonomy for global data at Lilly which enables

the rapid integration of data and mining/analysis algorithms into dataflows

which support clinical and discovery decisions



• W3C’s Health Care and Life Sciences Interest Group

Conclusion

• Semantic Web provides a flexible framework for data

integration

• Data integration needs (and issues) abound at Lilly

• Lilly is seeing tangible benefits in multiple projects

from semantic Web

• Focus on incremental adoption of the technology

• Tools are improving, but more work is needed

• Lilly use of Semantic Web technology isn’t atypical in

health care and life sciences organizations

W3C Semantic Web for Health Care

and Life Sciences Interest Group

What is the Mission of HCLS IG?

The mission of HCLS is to develop, advocate for, and

support the use of Semantic Web technologies for

biological science, translational medicine and health

care. These domains stand to gain tremendous benefit

by adoption of Semantic Web technologies, as they

depend on the interoperability of information from many

domains and processes for efficient decision support.

Task Forces

• Terminology – Semantic Web representation of existing resources

• Task lead - John Madden



• BioRDF – integrated neuroscience knowledge base

• Task lead - Kei Cheung



• Linking Open Drug Data – aggregation of Web-based drug data

• Task lead - Chris Bizer



• Scientific Discourse – building communities through networking

• Task leads - Tim Clark, John Breslin



• Clinical Observations Interoperability – patient recruitment in trials

• Task lead - Vipul Kashyap



• Other Projects: Clinical Decision Support, URI Workshop,

Collaborations with CDISC & HL7

Terminology Task Force

Task Lead: John Madden

Participants: Chimezie Ogbuji, Helen Chen, Holger

Stenzhorn, Mary Kennedy, Xiashu Wang, Rob Frost,

Jonathan Borden, Guoqian Jiang

Terminology: Overview

• Goal is to identify use cases and methods for extracting

Semantic Web representations from existing, standard

medical record terminologies, e.g. UMLS

• Methods should be reproducible and, to the extent

possible, not lossy

• Identify and document issues along the way related to

identification schemes, expressiveness of the relevant

languages

• Initial effort will start with SNOMED-CT and UMLS

Semantic Networks and focus on a particular sub-

domain (e.g. pharmacological classification)

BioRDF Task Force

Task Lead: Kei Cheung

Participants: Scott Marshall, Eric Prud’hommeaux,

Susie Stephens, Andrew Su, Steven Larson, Huajun

Chen, TN Bhat, Matthias Samwald, Erick Antezana,

Rob Frost, Ward Blonde, Holger Stenzhorn, Don

Doherty

BioRDF: Answering Questions

Goals: Get answers to questions posed to a body of

collective knowledge in an effective way



Knowledge used: Publicly available databases, and text

mining



Strategy: Integrate knowledge using careful modeling,

exploiting Semantic Web standards and technologies

BioRDF: Looking for Targets for Alzheimer’s

• Signal transduction pathways are

considered to be rich in “druggable”

targets



• CA1 Pyramidal Neurons are

known to be particularly damaged

in Alzheimer’s disease

• Casting a wide net, can we find

candidate genes known to be

involved in signal transduction and

active in Pyramidal Neurons?







Source: Alan Ruttenberg

BioRDF: Integrating Heterogeneous Data



PDSPki

Gene Reactome NeuronDB

Ontology

BAMS



Antibodies Allen Brain BrainPharm

Entrez Atlas

Gene MESH

Literature PubChem

Mammalian

Phenotype

SWAN

AlzGene

Homologene









Source: Susie Stephens

BioRDF: SPARQL Query









Source: Alan Ruttenberg

BioRDF: Results: Genes, Processes

DRD1, 1812 adenylate cyclase activation

ADRB2, 154 adenylate cyclase activation

ADRB2, 154 arrestin mediated desensitization of G-protein coupled receptor protein signaling pathway

DRD1IP, 50632 dopamine receptor signaling pathway

DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway

DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway

GRM7, 2917 G-protein coupled receptor protein signaling pathway

GNG3, 2785 G-protein coupled receptor protein signaling pathway

GNG12, 55970

DRD2, 1813

G-protein coupled receptor protein signaling pathway

G-protein coupled receptor protein signaling pathway

Many of the genes

ADRB2, 154

CALM3, 808

G-protein coupled receptor protein signaling pathway

G-protein coupled receptor protein signaling pathway

are related to AD

HTR2A, 3356

DRD1, 1812

G-protein coupled receptor protein signaling pathway

G-protein signaling, coupled to cyclic nucleotide second messenger through gamma

SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second messenger

MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide second messenger secretase

CNR2, 1269 G-protein signaling, coupled to cyclic nucleotide second messenger

HTR6, 3362

GRIK2, 2898

G-protein signaling, coupled to cyclic nucleotide second messenger

glutamate signaling pathway

(presenilin) activity

GRIN1, 2902 glutamate signaling pathway

GRIN2A, 2903 glutamate signaling pathway

GRIN2B, 2904 glutamate signaling pathway

ADAM10, 102 integrin-mediated signaling pathway

GRM7, 2917 negative regulation of adenylate cyclase activity

LRP1, 4035 negative regulation of Wnt receptor signaling pathway

ADAM10, 102 Notch receptor processing

ASCL1, 429 Notch signaling pathway

HTR2A, 3356 serotonin receptor signaling pathway

ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)

PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway

EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway

NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway

CTNND1, 1500 Wnt receptor signaling pathway









Source: Alan Ruttenberg

LODD Task Force

Task Lead: Chris Bizer

Participants: Anja Jentzsch, Kristin Tolle, Eric

Prud’hommeaux, Don Doherty, Susie Stephens, Bosse

Andersson, Scott Marshall, Glen Newton, Michel

Dumontier, TN Bhat, Oktie Hassanzadeh

LODD: Introduction

Use Semantic Web technologies to

1. publish structured data on the Web

2. set links between data from one data source to data within other data sources





Linked Data Linked Data Search

Browsers Mashups Engines









Thing Thing Thing Thing Thing



Thing Thing Thing Thing Thing



typed typed typed typed

links links links links





A B C D E



Source: Chris Bizer

LODD: Potential Links between Data Sets









Source: Chris Bizer

LODD: Data Set Evaluation









Source: Chris Bizer

LODD: Potential questions to answer

• Physicians and Pharmacists

• What are alternative drugs for a given indication (disease)?

• What are equivalent drugs (generic version of a brand name, or the

chemical name of a active ingredient)?

• Are there ongoing clinical trials for a drug?



• Patients

• What background information is available about a drug?

• What are the contraindications of a drug?

• Which alternative drugs are available?

• What are the results of clinical trials for a drug?



• Pharmaceutical Companies

• What are other companies with drugs in similar areas?

• Which companies have a similar therapeutic focus?



Source: Chris Bizer

LODD: Linked Version of ClinicalTrials.gov



• Total number of triples:

6,998,851



• Number of Trials:

61,920



• RDF links to other data

sources: 177,975



• Links to:

• DBpedia and YAGO

(from intervention and conditions)

• GeoNames (from locations)

• Bio2RDF.org's PubMed (from references)

Source: Chris Bizer

LODD: Mashing Clinical Trials and Geo









Classification

of Places



Geo

Coordinates









Source: Chris Bizer

Scientific Discourse Task Force

Task Lead: Tim Clark, John Breslin

Participants: Uldis Bojars, Paolo Ciccarese, Sudeshna

Das, Ronan Fox, Tudor Groza, Christoph Lange,

Matthias Samwald, Elizabeth Wu, Holger Stenzhorn,

Marco Ocana, Kei Cheung, Alexandre Passant

Scientific Discourse: Overview









Source: Tim Clark

Scientific Discourse: Goals

• Provide a Semantic Web platform for scientific

discourse in biomedicine

• Linked to

– key concepts, entities and knowledge

• Specified

– by ontologies

• Integrated with

– existing software tools

• Useful to

– Web communities of working scientists









Source: Tim Clark

Scientific Discourse: Some Parameters

• Discourse categories: research questions, scientific assertions

or claims, hypotheses, comments and discussion, and evidence

• Biomedical categories: genes, proteins, antibodies, animal

models, laboratory protocols, biological processes, reagents,

disease classifications, user-generated tags, and bibliographic

references

• Driving biological project: cross-application of discoveries,

methods and reagents in stem cell, Alzheimer and Parkinson

disease research

• Informatics use cases: interoperability of web-based research

communities with (a) each other (b) key biomedical ontologies (c)

algorithms for bibliographic annotation and text mining (d) key

resources



Source: Tim Clark

Scientific Discourse: SWAN+SIOC

• SIOC

• Represent activities and contributions of online communities

• Integration with blogging, wiki and CMS software

• Use of existing ontologies, e.g. FOAF, SKOS, DC



• SWAN

• Represents scientific discourse (hypotheses, claims, evidence,

concepts, entities, citations)

• Used to create the SWAN Alzheimer knowledge base

• Active beta participation of 144 Alzheimer researchers

• Ongoing integration into SCF Drupal toolkit









Source: Tim Clark

COI Task Force

Task Lead: Vipul Kashap

Participants: Eric Prud’hommeaux, Helen Chen,

Jyotishman Pathak, Rachel Richesson, Holger

Stenzhorn

COI: Bridging Bench to Bedside

• How can existing Electronic Health Records (EHR)

formats be reused for patient recruitment?

• Quasi standard formats for clinical data:

• HL7/RIM/DCM – healthcare delivery systems

• CDISC/SDTM – clinical trial systems



• How can we map across these formats?

• Can we ask questions in one format when the data is represented in

another format?









Source: Holger Stenzhorn

COI: Use Case

Pharmaceutical companies pay a lot to test drugs

Pharmaceutical companies express protocol in CDISC



-- precipitous gap –



Hospitals exchange information in HL7/RIM



Hospitals have relational databases









Source: Eric Prud’hommeaux

Inclusion Criteria



Type 2 diabetes on diet and exercise therapy or

monotherapy with metformin, insulin

secretagogue, or alpha-glucosidase inhibitors, or

a low-dose combination of these at 50%

maximal dose. Dosing is stable for 8 weeks prior

to randomization.



?patient takes meformin .









Source: Holger Stenzhorn

Exclusion Criteria



Use of warfarin (Coumadin), clopidogrel

(Plavix) or other anticoagulants.





?patient doesNotTake anticoagulant .









Source: Holger Stenzhorn

Criteria in SPARQL

?medication1 sdtm:subject ?patient ;

spl:activeIngredient ?ingredient1 .

?ingredient1 spl:classCode 6809 . #metformin

OPTIONAL {

?medication2 sdtm:subject ?patient ;

spl:activeIngredient ?ingredient2 .

?ingredient2 spl:classCode 11289 .

#anticoagulant

} FILTER (!BOUND(?medication2))









Source: Holger Stenzhorn

Getting Involved

• Benefits to getting involved include:

• early access to use cases and best practice

• influence standard recommends

• cost effective exploration of new technology through collaboration



• Get involved by contacting the chairs:

• team-hcls-chairs@w3.org



Related docs
Other docs by hedongchenchen
AMS11-AV-Order-form
Views: 0  |  Downloads: 0
Rural Telephone Bank
Views: 5  |  Downloads: 0
04tbl2-32a
Views: 0  |  Downloads: 0
CG9 Licence No.
Views: 0  |  Downloads: 0
1996
Views: 0  |  Downloads: 0
2011 CATALOG
Views: 11  |  Downloads: 0
NEURO-_summary.doc - STJ PA 2012
Views: 1  |  Downloads: 0
1995-1996 Prepaid Health Plan Contract
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!