Spreading Semantics Over
Biology
Phillip Lord
Newcastle University
Overview
• Conclusions
• Data Integration in ComparaGRID
• Annotation in CARMEN and CISBAN
• Computing with Semantics
• The future
Conclusions
• Thin Semantics is Good
• More Semantics is Better
• Shared Semantics is Wonderful
Key Problems
• Scalability
– Both in technology and processes
• Usability
• Autonomy
Methods for Data Integration
– Combining data from multiple, autonomous
data sources.
• TAMBIS
– ontology driven mediation of querying
• EcoCyc
– ontology driven schema for warehousing
• BioPAX
– ontology defined interchange format.
– More recently, ComparaGRID
ComparaGRID
• 6 Investigators
• 5 Researchers
Roslin
Newcastle
Manchester
Cambridge
John Innes
NCYC
• Commenced: 2003
ComparGRID’s Problem Domain
Many Model Organism Databases
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt
12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg
12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga
12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc
12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa
12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Data Models, Model Data
Databases and Knowledge
database domain ontology
Sequence
Molecule Representation
SequenceRecord DNA SequenceRepresentation
id length seqString
S_hasID S_hasLength S_hasSeqStr
The Fluxion Stack
Raw data Syntax Semantics Aggregation
JDBC OWL OWL
Raw Pub Trans
data service service
integrator
data
query
The difficulties
• The Cost of Integration
– building ontologies is often hard
• The Cost of Managing Change
– biological knowledge tends to undergo a lot of
flux
• The Scalabilty of Expressive Ontologies.
Getting the Semantics Upfront
• Instead of annotating heterogenous data
sources after the event, why not do so
upfront?
• Originators of the data are likely to
understand it best.
• Spreads the cost among those
contributing.
CARMEN
Code, Analysis, Repository and
Modelling for e-Neuroscience
www.carmen.org.uk
Engineering and Physical
Sciences Research Council
Consortium & Profile
• £4M over 4 years
Stirling
• 20 Investigators St. Andrews
Newcastle
York
Manchester
Sheffield
Leicester
Cambridge
Warwick
Plymouth Imperial
• Commenced 1st October 2006
Virtual Laboratory for Neurophysiology
• Enabling sharing and
collaborative exploitation of
data, analysis code and
expertise that are not
physically collocated
The need for clear metadata
• Most neurosciences data is relative simple in structure
• But often contextually complex
• Sometimes associated with behavioural features
How do we represent…
In silico Analysis
Derived data
Laboratory
Experiments
Functional Genomics Experiment
(FuGE)
• Model of common components in science
investigations, such as materials, data, protocols,
equipment and software.
• Provides a framework for capturing complete
laboratory workflows, enabling the integration of
pre-existing data formats.
Re-use
Brain anatomy Taxonomy
BIRNLex, FMA NCBI Taxonomy
CARMEN
Sample preparation
sepCV
What we need – lab based
Age/stage development Subject preparation
Experiment process CARMEN Subject training
Equipment Subject stimulus
Subject task
What we need – In silico
File formats Data structures
CARMEN
Statistics Algorithms
Software
Align with OBI
Ontology
for
Biomedical
Investigations
• Aims to provide an ontology for the life sciences
• Consortium to 15 communities from crop
science to neuroscience
• CARMEN will align and contribute to OBI
The Difficulties
• Even with a lot of pre-existing work there
Bio-ImagingJeff GretheBiomedical Informatics Research Network (BIRN) Coordinating CenterUniversity of California, San DiegoWinter
2007 Daniel RubinRadiological Society of North America (RSNA)National Center for Biomedical Ontology at Stanford Medical Informatics and
the Department of Radiology, Stanford UniversityWinter 2007 Bill BugBiomedical Informatics Research Network (BIRN)Laboratory of
is a lot to describe
Bioimaging and Anatomical Informatics, in the Department of Neurobiology and Anatomy, Drexel University College of MedicineSpring
2006 Cellular AssaysStefan Wiemann DKFZ Clinical InvestigationsJennifer FostelClinical Trial OntologyNIEHS, National Institute for
Environmental Health SciencesSpring 2004 Tina Hernandez-Boussard Department of Genetics, Stanford Medical SchoolFall 2007 Crop
SciencesRichard BruskiewichGeneration Challenge ProgrammeIRRI ElectrophysiologyFrank GibsonCARMENSchool of Computing Science,
• OBI has 15 communities involved in it
Newcastle UniversitySpring 2007 Environmental OmicsNorman Morrison NERC Environmental Bioinformatic Centre and School of Computer
Science, The University of ManchesterSpring 2004 Flow CytometryRyan BrinkmanISAC and FICCSBritish Columbia Cancer Research Center
and University of British Columbia in the Department of Medical Genetics , Vancouver, BC, CanadaSpring
2004 Genomics/MetagenomicsDawn FieldGenome CatalogueNERC Centre for Ecology and HydrologyWinter 2005 Tanya GrayWinter
2005 ImmunologyRichard ScheuermannImmPort, FICCS, BioHealthBaseUniversity of Texas Southwestern Medical Center, in in Department of
Pathology and Division of Biomedical InformaticsSpring 2006 Bjoern PetersImmune Epitope Database and Analysis ResourceLa Jolla Institute
for Allergy and ImmunologySpring 2006 In Situ Hybridization and ImmunohistochemistryEric DeutschMISFISHIE MetabolomicsSusanna
SansoneMSI, The European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 Daniel SchoberSpring 2006 NeuroinformaticsBill
BugBiomedical Informatics Research Network (BIRN)Laboratory of Bioimaging and Anatomical Informatics, in the Department of Neurobiology
and Anatomy, Drexel University College of MedicineSpring 2006 Frank GibsonCARMENSchool of Computing Science, Newcastle
UniversitySpring 2007 NutrigenomicsPhilippe Rocca-SerraRSBIThe European Bioinformatics Institute EBI-EMBL, NET ProjectSpring
2004 PolymorphismTina Hernandez-BoussardPharmGKBDepartment of Genetics, Stanford Medical SchoolWinter 2006Fall
2007ProteomicsSusanna SansonePSIThe European Bioinformatics Institute EBI-EMBL, NET ProjectSpring 2004 Daniel SchoberSpring
2006 Luisa MontecchiThe European Bioinformatics Institute EBI-EMBLSpring 2006 Chris Taylor Trish Whetzel Spring 2004 Frank
GibsonSchool of Computing Science, Newcastle UniversitySpring 2007 ToxicogenomicsJennifer FostelToxicogenomicsNIEHS, National
Institute for Environmental Health SciencesSpring 2004 Susanna SansoneRSBI The European Bioinformatics Institute EBI-EMBL, NET
ProjectSpring 2004 TranscriptomicsSusanna SansoneMGED The European Bioinformatics Institute EBI-EMBL, NET ProjectSpring
2004 Philippe Rocca-SerraSpring 2004 Trish Whetzel Spring 2004 Chris StoeckertDepartment of Genetics and Center for Bioinformatics,
University of PennsylvaniaSpring 2004 Gilberto FragosoNCI Center for BioinformaticsSpring 2004 Joe White Helen ParkinsonThe European
Bioinformatics Institute EBI-EMBLSpring 2004 Mervi Heiskanen Liju FanOntology Workshop, LLC, Columbia, MD, USASpring 2004 Helen
CaustonImperial CollegeSpring 2004
Information Extraction
• More semantics is better?
• How do we get extract the information?
http://en.wikipedia.org/wiki/Image:Brain_090407.jpg
Centre for Integrated Systems
Biology of Ageing and
Nutrition (CISBAN)
Identification of novel interactions between nutrition and damage
using automated yeast screening and analysis
Screen mutants for sensitivity
to damage/nutrition
*
*
*
‘Folate’ + - + -
‘MMS’ - - + +
Robot Robot
• Data curation.
• Functional analysis.
• Interactions with in silico
programme.
Reference set of 5,000
mutant strains
CISBAN dataflow
Data Entry with SYMBA
http://symba.sourceforge.net/
Data Entry with SYMBA
CARMEN and CISBAN
• We can provide more semantics upfront
• This should make data more explicit
• If we still need to integrate it should be easier.
• Like much of biology, these projects are largely
using structural simple, non-SW based
technologies.
• This is a lot of effort to go to; what do we hope to
gain?
Yeast Hub
YeastHub: a semantic web use
case for integrating data in the life
sciences domain
Kei-Hoi Cheung, Kevin Y. Yip, Andrew Smith,
Remko deKnikker, Andy Masiar and Mark
Gerstein
doi:10.1093/bioinformatics/bti1026
A rapturous reception
• So the general idea is take a bunch of data,
convert it to RDF, dump it into a RDF triple store
[…] to discover interesting things ?
– http://www.nodalpoint.org/user/greg
• Putting a lot of RDF in a bucket isn’t integration.
Not unless the RDF is the same schema and
using the same concepts
– Carole Goble, University of Manchester
A thin layer of semantics.
• Inverse Document Frequency is a method for classifying documents;
rare words carry more information than common ones.
• In this case, YeastHub has a common semantics describing the type
of document.
• “protein” or “sequence” occurs a lot in Uniprot, but less in the bulk
corpus
• Rather than treating all documents equally, they use IDF twice.
• Leveraging Biological Identifier Relationships and Related Documents to Enhance Information
Retrieval for Proteomics -- Smith et al., 10.1093/bioinformatics/btm452 – Bioinformatics
Thin Semantics
• The semantics of YeastHub is not deep.
• But even a thin layer of semantics is
useful.
• If we modify our technologies to use it.
• A large part of library sciences has been
encoded in 15 tags – Dublin Core
Using Ontology to Classify
Members of a Protein Family
• Katy Wolstencroft (Bioinformatics)
• Daniele Turi (Instance Store)
• Phil Lord (myGrid)
• Lydia Tabernero (Protein Scientist)
• Matt Horridge, Nick Drummond et al (Protégé OWL)
• Andy Brass and Robert Stevens (Bioinformatics)
The Protein Phosphatases
• A large superfamily of proteins
• Motifs determine a protein’s place within
the family
• Recognising that motifs imply class
membership is normally manual
• Can these be captured in an ontology?
Phosphatase Functional
Domains
Andersen et al (2001) Mol. Cell. Biol. 21 7117-36
Definition of Tyrosine
Phosphatase
Class TyrosineRreceptorProteinPhosphatase
EquivalentTo: Protein That
- (contains atLeast-1
ProteinTyrosinePhosphataseDomain and
- contains 1 TransmembraneDomain
Classifying Proteins
>uniprot|Q15262|PTPK_HUMAN Receptor-type protein-tyrosine
phosphatase kappa precursor (EC 3.1.3.48) (R-PTP-kappa).
MDTTAAAALPAFVALLLLSPWPLLGSAQGQFSAGGCTFDDGPGACDYHQDLYDDFEWVHV
SAQEPHYLPPEMPQGSYMIVDSSDHDPGEKARLQLPTMKENDTHCIDFSYLLYSQKGLNP
GTLNILVRVNKGPLANPIWNVTGFTGRDWLRAELAVSSFWPNEYQVIFEAEVSGGRSGYI
AIDDIQVLSYPCDKSPHFLRLGDVEVNAGQNATFQCIATGRDAVHNKLWLQRRNGEDIPV………..
InterPro
Codify
Translate
Instance Store
Reasoner
Results
• Human phosphatases have been classified using the system
• The ontology system refined classification
- DUSC contains zinc finger domain
characterised and conserved – but not in classification
- DUSA contains a disintegrin domain
previously uncharacterised – evolutionarily conserved
• We have automated a part of the scientific process
– We have defined our domain model in a computational form
– We have collected some data
– We have let the reasoner test whether the model fits the data
• The semantics here are deeper with YeastHub, which allow us to reason
Summary
• Ontologies have been used in life sciences
for data integration
• Increasingly, are being used to describe
the data early in the scientific process
• Even thin semantics can be exploited for
information retrieval
• Richer semantics allows more use of
computational inference
Richer Expressivity
• There are applications of more expressive
semantics
• Can we move to from specific software, to
generic software with specific knowledge
models
• But, scalability and usability remain the
bottleneck
Industrialisation
• Semantics in the life sciences is moving
from small to large scale
– building ontologies has now become very
committee driven
– we don’t understand ontology engineering as
we do software engineering
– Encapsulation, modularisation, continuous
integration.
Future
• ComparaGRID has semantics describing
schema which means data integration can
happen on-the-fly.
• Death to data warehouses!
• CARMEN and CISBAN are gathering
semantically enriched data in the first
place. An End to Integration!
• Semantics during dissemination
• Knowledge for All.
Acknowledgements
The ComparaGRID consortium is Madhuchhanda Bhattacharjee, Richard Boys,
Tony Burdett, Rob Davey, Jo Dicks, David Marshall, Andy Law, Phillip
Lord, Trevor Paterson, Matthew Pocock, Peter Rice, Ian Roberts, Robert
Steven, Paul Watson, Darren Wilkinson and Neil Wipat, Andy Gibson
CISBAN is Tom Kirkwood (PI), Thomas von Zglinicki (PI), David Lydall (PI),
Anil Wipat (PI), Stephen Addinall (Research Associate), Suzanne Advani
(Technician), Kim Clugston (Research Associate), Sharon Denley (PA to
Professor Tom Kirkwood), Amanda Greenall (Research Associate), Jennifer
Hallinan (Research Associate), Dominic Kurian (Research Associate),
Conor Lawless (Research Associate), Guiyuan Lei (Research Associate),
Allyson Lister (Research Associate), Mandy Maddick (Research Associate),
Satomi Miwa (Research Associate), Glyn Nelson (Research Associate),
Bob Nicholson (Superintendent), Sharon Oljslagers (Technician), Joao
Passos (Research Associate), Carole Proctor (Research Associate), Daryl
Shanley (Research Associate), Oliver Shaw (Research Associate), Donna
Stark (Research Secretary), Laura Steedman (Technician), Joyce Wang
(Technician), Darren Wilkinson (Professor of Stochastic Modelling)
CARMEN Acknowledgements
Professor Colin Ingram, Professor Jim Austin, Professor Leslie Smith, Professor
Paul Watson Dr. Stuart Baker,Professor Roman Borisyuk, Dr. Stephen Eglen,
Professor Jianfeng Feng, Dr. Kevin Gurney, Dr. Tom Jackson Dr. Marcus Kaiser, Dr.
Phillip Lord, Dr. Paul Overton, Dr. Stefano Panzeri, Dr. Rodrigio Quian Quiroga, Dr.
Simon Schultz, Dr. Evelyne Sernagor, Dr. V. Anne Smith, Dr. Tom Smulders
Professor Miles Whittington, Christoph Echtermeyer, Martyn Fletcher,
Frank Gibson, Mark Jessop Dr. Bojian Liang, Juan Martinez-Gomez,
Dr. Chris Mountford, Agah Ogungboye, Georgios Pitsilis, Dr. Daniel Swan
The
University
Of
Sheffield
University of
St Andrews
Holiday Pictures
Questions?