Image Driven Ontology Editing (The Morphster Project)
Document Sample


Image Driven Ontology Editing
(The Morphster Project)
Professor Daniel P. Miranker
Department of Computer Sciences
Institute of Cell and Molecular Biology
University of Texas at Austin
Austin, Texas, USA
http://cs.utexas.edu/~miranker
Biology Collaborators: CS Students:
Professor Timothy Rowe Sapna Bafna
Dr. Julian Humphries Rui Mao
Kerin Claeson Hamid Tirmizi
Scope of the talk
• Application Context:
– Cyberinfrastructure for the NSF ATOL Grand Challenge,
wrt Image-Based Information
– Technology Integration
• LSIDs
• SOA - Image related services
• OBO/OWL Semantic Web
• [Computer] Scientific issues arose very quickly
– Semantics of Images… we have none.
2
NSF Grand Challenge
“Assembly of a
framework phylogeny, or
Tree of Life, for all 1.7
million described
species…”
“Phylogeny, the
genealogical map for all
lineages of life on earth,
provides an overall
framework to facilitate
information retrieval and
biological prediction”
“This is the overall goal
of the Assembling the
Tree of Life
activity”[NSF]
3
An Algorithmic Focus on Tree Reconstruction From
Sequences
• Computational Problem
– Central
– Shared
Among groups.
T1 T2 T3 T4 T5 T6 T7
Characters
Taxa C1 C2 C3 C4 C5 C6
T1 A C G T T A
T2 C C C T T A
T3 G C G G T A
…
4
Morphological Formulation
Thanks, copyrighted Berkeley Material
5
Morphological Formulation (finer scale)
6
Central Data Model
Char
States
Matrix Tree Trees
Reconstruction
Algorithm
Taxa
Study Set Up Study Result
An Instance = Treebase record, (Treebase == Genbank for PhyloTrees
7
Legacy Challenge
Char
States
Matrix Tree Trees
Reconstruction
Algorithm
Taxa
In Matrix Representation:
Character states are coded as integers
• No data provenance
• No metadata
• No semantic content
• // treebase does have a lot comment fields
Might be ok if the tree was the ends, but it is a means 8
Systematics’ Workflow
1. Collect Field Specimens
2. Specimen Collections
5. Interpretation
Collection
Management • Geographic distribution
(biodiversity)
• Speciation
Collection Data • Gene function
3. Study Creation
choose a selection of specimens
• determine features
• choice of genes (alignments)
• choice of morphological feature
• results in a matrix
4. Phylogenetic
Reconstruction
Treebase
(public archive)
9
Scientists Have Their Groups (by organism type)
Character Source
Databases • Don’t agree on names of the taxa
Genbank • Don’t even agree on the number of bytes to
represent the string
Swissprot
Bacteria
Char
Dinosaurs
States
Plants 1 Matrix Reconstruction
Efforts
Trees
Taxa
Plants 2
.
10
Where Does Morphster Fit In?
1. Collect Field Specimens
2. Specimen Collections
Collection
Management
Collection Data
Morphster
• choose a selection of specimens
• collect imagery
• support image annotation
• create a matrix
4. Phylogenetic
Reconstruction
Treebase
(public archive)
11
Initially Three Promises
1. Address character state richness issue
A. Using GUIDs (LSIDs)
B. Suggest a standard set of services on image and character sources.
2. Source critical nomenclature from authoritative sources
A. Taxonomic names
B. Nomina Anatomica (N.A.)
Agreed upon exemplar anatomies for a group
Appearing as Ontologies
3. Record observations as image annotations in a database.
12
Morphster Dataflow
uBio
Catalog of Taxonomic Names
UTCT
matrix
Image Database Morphster
(one of many)
ZFIN
Anatomical
Ontology
(one of many)
13
uBio (Universal Biological Indexer and Organizer)
– Manages information about organisms
– Attempts to solve naming problems
– Provides globally unique Life Science
Identifiers (LSID) to taxa
– Web service that serves RDF
Acalyptus
uBio.org
14
Anotated Specimen Proxies
Giant Hummingbird Skull (Patagona gigas)
DigiMorph.org
15
Agnostic of Organism Group
Primary concern is Imagery
Rosa minutifolia
uBio
Catalog of Taxonomic Names
UTCT 1
Treebase
Image Database
(one of many)
No image
ZFIN capacity
Anatomical
Ontology
16
LSID as Surrogate (pointer) in Distributed Data Structure
• For each biological data set of interest,
Assign an LSID
• Properties of LSIDs
1. Protocol to locate data.
2. Data never changes
3. Data is never deleted
17
Treebase: No image storage, use LSIDs instead
L2:Rosa minutifolia
uBio
Catalog of Taxonomic Names
Define a
L3: character
UTCT
state, assign
Image Database LSID Treebase
L1
(one of many) L1=
L4: ZFIN L2 L3 L4 …
Anatomical
uBio UTCT Ontology Term
Ontology
Data Provenance
18
Standard Set of Services
Across Types of Databases
“get.somethings”
SourceID.Get.Something(Obj)
e.g. SourceID:Obj
• Get.Full-image()
• Get.Text-description()
• Get.Author()
• Get.location()
Taxa
Reconstruction
Char Matrix Trees
Efforts
States
SourceID + Obj == LSID
19
Use Case
Enable a browsable hypertext experience detailing the information in
the study that produce this tree.
20
Focus (in the proposal stage)
• Functional:
– Databasing scientific observation of phenomena in the form
of image annotations.
• Technological:
– Development and Demonstration of Web Service
Architecture
21
Contributions to Date
• System for adding LSIDs to legacy [RDBMS] data
sources
• OBO to OWL round trip translation
– (one theorem short of formal semantics for OBO)
• Software is mature enough to create “Illustrated
Nomina Anatomica”
22
At the start: design the data model
Example Records;
• Character:
– (stamen, shape of stamen)
• Character States:
– (shape of stamen, (
• Round,
• Pointed,
• Cleft) )
• Image Anotation:
– (Image27, taxa17, (
• (shape of stamen, cleft)
• (length of stamen, long-relative to ovum)
• (number of petals, 12)
• (petal type,….)))
23
But nearly all the values are scientific terms.
• Character:
– (stamen, shape of stamen)
• Character States:
– (shape of stamen, (
• Round,
• Pointed,
• Cleft) )
• Image Anotation:
– (Image27, taxa17, (
• (shape of stamen, cleft)
• (length of stamen, long-relative to ovum)
• (number of petals, 12)
• (petal type,….)))
If all the terms are defined in an ontology,
what’s left for the relational data model?
24
Given: Morphster must import ontologies
All the “data” is best served by ontological representation
25
Scientific Study Comprises:
• Identifying a putative character
• Defining exemplars of the states.
– Text
– Image
– Image annotations
• Associating a Taxon with a
character state
– Also image driven
26
Oops
• Morphster must include ontology function
• The DBMS function better served as an ontology
• The system is big enough,
– Let’s get rid of the DBMS.
27
Which ontology language, OBO or OWL?
• Most ATOL projects are OBO
– Mature but limited technology
– Basis of Gene Ontology (GO)
– 60 other ontologies
• Most Computer Scientists are working on OWL
– Larger, more powerful infrastructure
– More jobs
• Answer: Yes:
28
OBO --> OWL --> OBO Round Trip
• OWL (meaning the semantic web) is strictly more
powerful than OBO
29
Two Layer Cake Methodology
• Recall, Semantic Web is organized as a hierarchy of
languages:
30
Two Layer Cake Methodology
• Computational Linguistic Approach (Grammar)
– For each OBO construct,
• determine the least expressive layer in the OWL stack necessary to
express the construct.
31
Result of the Analysis 1:
Examples of Elements OBO Layer Cake Semantic Web Layer Cake
Subsets of the OBO language
• organized into a hierarchy of expressive power
• direct correspondence to the Semantic Web
32
Result of the Analysis 2
• Identification of OBO constructs without isomorphism
to OWL
33
Mismatched Constructs
• OWL rigorous embodiment of Globally unique
identifiers and name spaces.
• OBO has built-in synonym concepts.
• OBO has a built-in concept of ontology subset.
– Anticipated disagreements
– Not addressed in OWL
34
GUID Issue Most Interesting
• Class identifiers and namespaces need global Ids
– As we map class identifiers and namespaces to OWL
• Record the mapping in the OWL ontology.
• Provides the information to losslessly reverse the transfomation
35
Image as Class Exemplar (not)
Consider: inheritance Consider: image exemplar
Appears_in
has polygon
polygon sides Poly.jpg
rectangle triangle rectangle triangle
• By inheritance, • The hexagon is not an
– Rectangle has sides example of a rectangle or
– Triangle has sides triangle.
36
LSIDs: The Social Contract
• Collect and analyze data
– Results are permanent
– Experiments may be duplicated
• Today experiments comprise
– Collecting data from many places
– Long data analysis pipeline
– Data is published on web pages that disappear.
37
Laboratory Notebook vs. Computer File
But will the computer files be there in 40 years?
Is this important?
• Old data
– Can be revisited for new results
– Solve controversy, (both scientific and personal)
• Example: Photo 51
– Watson, Crick &Wilkens received the Nobel prize for the DNA double-
helix
– Woman scientists concerned, Rosalind Franklyn did not get enough
credit.
– Look at her notebook page with “Photo 51”. Its clear, she may deserve
the most credit; history books are being rewritten
• http://www.pbs.org/wgbh/nova/photo51/
• http://www.chemheritage.org/classroom/chemach/pharmaceuticals/watson-crick.html
38
Example: The Morphster Project
• Morphster is part of the Assembling the Tree of Life
grand challenge problem.
– Build an evolutionary (phylogenetic) tree for all the species
on the planet.
39
What are the parts of the LSID protocol?
1. Service Oriented Architecture
• Both SOAP/WSDL and Get/Post (REST) interfaces
2. Assignment Services
• Assign an LSID to a data set
3. Discovery Service
• Special places to determine where to find service for an LSID, (like
DNS services)
4. Resolution Service
• Details what services are associated with an LSID
5. Retrieval Service
• getData(LSID)
• getMetaData(LSID)
40
Simple version:
1. Machinery to make a simple distributed computing
idea to work.
2. Given data D
• New LSID L = assignLSID(D)
• getData(L), returns D
• getMetadata(L), return data type of D
41
Complete Interface from the Standard Document
QuickTime™ and a
TIFF (LZW) decompressor
are neede d to see this picture.
42
Syntax of the LSID String
Fields:
• "URN"
• "LSID"
• authority identification
• namespace identification
• object identification
• optionally: revision identification.
Examples:
URN:LSID:ebi.ac.uk:SWISSPROT.accession:P34355:3
URN:LSID:rcsb.org:PDB:1D4X:22
Notice:
• Use of domain names (e.g. ebi.ac.uk) for authorities
– Enables using DNS until LSID discovery services are built
• Good to get started,
• Bad, people think LSIDs are URLs.
43
Basic Rules of the Protocol
• Data can never be changed
• Data can have versions
– Old versions never disappear (or change)
– getData(LSID), if no version number, newest version is
returned.
• Metadata
– Can be changed
– Can be available in different syntactic formats
• Data
– Can be available in different syntactic formats
• But must promise to be the identical
44
Schema Driven Assignment of LSIDs
• Recognize most data is in relational database
management system.
• “Exporting” part of a database is a standard database
concept.
• Define export part of a database using an
export schema
• Replace assignment service by automatically assigning
an LSID to each exported row.
– Does not violate the protocol, provided the rest of the
protocol is correctly implemented
45
The Full Picture: Could be Implemented by a Compiler
CREATE LSIDVIEW SpecimenImage AS
SELECT scientific_name, specimen_image
FROM Specimen Legacy Database Catalog
Compiler
Triggers
RDF
Descriptions of
the Data
Ancillary Table
Definitions to Implement Data MetaData
LSID Resolution
Legacy
Database Archived LSID
Data Resolution
Runtime
46
Example:
• Suppose we have a table: DrugScreen
Protein- Sequenc Spectra Date- Drug-
Name e Created Interaction
Hemoglobin x y 10/27/06 Yes
Insulin S T 12/01/05 No
• We want to make it available
– But leave drug-interaction out, it is private
Protein- Sequence Spectra Date-Created
Name
Hemoglobin x y 10/27/06
Insulin S T 12/01/05
– Export schema (Protein-name,Sequence, Spectra,Data)
47
Created a Language to Specify This:
Create LSIDVIEW PublicSceenData as
Select Protein-name,Sequence, Spectra,Data
From Drug-Screen
Where ;
Internally we build a table:
LSID Protein- Sequence Spectra Date-Created
Name
Psd-1 Hemoglobin x y 10/27/06
Psd-2 Insulin S T 12/01/05
… …
48
Database Triggers
• SQL programming construct
Create Trigger ….
• On each insert or update to Insert new row
a table
– A user defined program is Trigger: Do
called instead:
49
Relational Databases Implement Triggers
• Provides an easy implementation
Create Trigger populateLSIDdata
On DrugScreen
Instead of Insert
// do the original insert
// also insert into the LSIDtable
• Copy new data into separate tables
– Helps us guarantee no changes to the data
– Everything in the existing database stays the same
50
Laboratory Notebooks in Life Science
Given an experiment:
• Laboratory Notebook
– Records the design of experiment
– Chronicles the execution of an experiment
• Time of each step
• Anything unusual about each step
• Observations of how things are proceeding
– Records the final measurements and other results
• Means to duplicate (repeat) experiments
51
Related docs
Get documents about "