Image Driven Ontology Editing (The Morphster Project)

Document Sample
Image Driven Ontology Editing (The Morphster Project) Powered By Docstoc
					 Image Driven Ontology Editing
    (The Morphster Project)
            Professor Daniel P. Miranker

             Department of Computer Sciences
           Institute of Cell and Molecular Biology
                University of Texas at Austin
                     Austin, Texas, USA
               http://cs.utexas.edu/~miranker

Biology Collaborators:                    CS Students:
Professor Timothy Rowe                    Sapna Bafna
Dr. Julian Humphries                      Rui Mao
Kerin Claeson                             Hamid Tirmizi
Scope of the talk
• Application Context:
   – Cyberinfrastructure for the NSF ATOL Grand Challenge,
     wrt Image-Based Information
   – Technology Integration
      • LSIDs
      • SOA - Image related services
      • OBO/OWL Semantic Web




• [Computer] Scientific issues arose very quickly
   – Semantics of Images… we have none.



                                                             2
NSF Grand Challenge

                      “Assembly of a
                      framework phylogeny, or
                      Tree of Life, for all 1.7
                      million described
                      species…”

                      “Phylogeny, the
                      genealogical map for all
                      lineages of life on earth,
                      provides an overall
                      framework to facilitate
                      information retrieval and
                      biological prediction”

                      “This is the overall goal
                      of the Assembling the
                      Tree of Life
                      activity”[NSF]


                                                   3
An Algorithmic Focus on Tree Reconstruction From
Sequences
• Computational Problem
     – Central
     – Shared
     Among groups.
                                          T1 T2   T3 T4 T5 T6   T7

                     Characters
 Taxa    C1   C2      C3   C4     C5 C6
 T1      A    C       G    T      T   A
 T2      C    C       C    T      T   A
 T3      G    C       G    G      T   A
 …



                                                                     4
Morphological Formulation




Thanks, copyrighted Berkeley Material
                                        5
Morphological Formulation (finer scale)




                                          6
Central Data Model


       Char




       States
                         Matrix         Tree             Trees
                                    Reconstruction
                                      Algorithm
       Taxa




              Study Set Up                       Study Result

   An Instance = Treebase record, (Treebase == Genbank for PhyloTrees

                                                                        7
Legacy Challenge


      Char




      States
                       Matrix             Tree               Trees
                                      Reconstruction
                                        Algorithm
      Taxa

               In Matrix Representation:
               Character states are coded as integers
                   • No data provenance
                   • No metadata
                   • No semantic content
                   • // treebase does have a lot comment fields

               Might be ok if the tree was the ends, but it is a means   8
Systematics’ Workflow
      1. Collect Field Specimens

  2. Specimen Collections
                                                         5. Interpretation
   Collection
  Management                                             • Geographic distribution
                                                         (biodiversity)
                                                         • Speciation
          Collection Data                                • Gene function

                                        3. Study Creation
                     choose a selection of specimens
                    • determine features
                         • choice of genes (alignments)
                         • choice of morphological feature
                    • results in a matrix
                                                      4. Phylogenetic
                                                      Reconstruction


                                                                    Treebase
                                                                 (public archive)
                                                                                     9
Scientists Have Their Groups (by organism type)

      Character Source
        Databases        • Don’t agree on names of the taxa
          Genbank        • Don’t even agree on the number of bytes to
                         represent the string
          Swissprot

          Bacteria

                             Char
          Dinosaurs
                             States
          Plants 1                    Matrix   Reconstruction
                                                  Efforts
                                                                Trees


                             Taxa

          Plants 2


                .

                                                                        10
Where Does Morphster Fit In?
      1. Collect Field Specimens

  2. Specimen Collections
   Collection
  Management

          Collection Data

                                  Morphster
                    • choose a selection of specimens
                    • collect imagery
                    • support image annotation
                    • create a matrix


                                                        4. Phylogenetic
                                                        Reconstruction


                                                                      Treebase
                                                                   (public archive)
                                                                                      11
Initially Three Promises
1.        Address character state richness issue
     A.        Using GUIDs (LSIDs)
     B.        Suggest a standard set of services on image and character sources.


2.        Source critical nomenclature from authoritative sources
     A.        Taxonomic names
     B.        Nomina Anatomica (N.A.)
                 Agreed upon exemplar anatomies for a group
                 Appearing as Ontologies


3.        Record observations as image annotations in a database.




                                                                                    12
Morphster Dataflow


           uBio


Catalog of Taxonomic Names



           UTCT

                                         matrix
     Image Database          Morphster
      (one of many)


          ZFIN


       Anatomical
        Ontology
      (one of many)
                                                  13
uBio (Universal Biological Indexer and Organizer)


   – Manages information about organisms
   – Attempts to solve naming problems

   – Provides globally unique Life Science
     Identifiers (LSID) to taxa

   – Web service that serves RDF




                             Acalyptus

                                                    uBio.org
                                                               14
Anotated Specimen Proxies




Giant Hummingbird Skull (Patagona gigas)




                                           DigiMorph.org
                                                       15
Agnostic of Organism Group

                             Primary concern is Imagery

      Rosa minutifolia
            uBio

Catalog of Taxonomic Names




            UTCT                 1


                                                          Treebase
     Image Database
      (one of many)

                                                      No image
            ZFIN                                      capacity

       Anatomical
        Ontology

                                                                     16
LSID as Surrogate (pointer) in Distributed Data Structure


•   For each biological data set of interest,
              Assign an LSID

•   Properties of LSIDs
    1. Protocol to locate data.
    2. Data never changes
    3. Data is never deleted




                                                            17
Treebase: No image storage, use LSIDs instead


   L2:Rosa minutifolia
            uBio

Catalog of Taxonomic Names


                                   Define a
      L3:                          character
             UTCT
                                   state, assign

       Image Database              LSID                         Treebase
                                                                    L1
        (one of many)             L1=

L4:          ZFIN                          L2 L3 L4 …

        Anatomical
                                          uBio UTCT Ontology Term
         Ontology
                         Data Provenance
                                                                           18
Standard Set of Services
           Across Types of Databases

           “get.somethings”



            SourceID.Get.Something(Obj)
            e.g.                                        SourceID:Obj
                     • Get.Full-image()
                  • Get.Text-description()
                        • Get.Author()
                       • Get.location()
            Taxa


                               Reconstruction
            Char      Matrix                    Trees
                                  Efforts


            States




                     SourceID + Obj == LSID

                                                                       19
Use Case
  Enable a browsable hypertext experience detailing the information in
  the study that produce this tree.




                                                                         20
Focus (in the proposal stage)
• Functional:
   – Databasing scientific observation of phenomena in the form
     of image annotations.


• Technological:
   – Development and Demonstration of Web Service
     Architecture




                                                                  21
Contributions to Date
• System for adding LSIDs to legacy [RDBMS] data
  sources
• OBO to OWL round trip translation
   – (one theorem short of formal semantics for OBO)
• Software is mature enough to create “Illustrated
  Nomina Anatomica”




                                                       22
At the start: design the data model
Example Records;
• Character:
   – (stamen, shape of stamen)
• Character States:
   – (shape of stamen, (
                           • Round,
                           • Pointed,
                           • Cleft) )


• Image Anotation:
   – (Image27, taxa17, (
                           •   (shape of stamen, cleft)
                           •   (length of stamen, long-relative to ovum)
                           •   (number of petals, 12)
                           •   (petal type,….)))
                                                                           23
But nearly all the values are scientific terms.

• Character:
    – (stamen, shape of stamen)
• Character States:
    – (shape of stamen, (
                              •   Round,
                              •   Pointed,
                              •   Cleft) )
• Image Anotation:
    – (Image27, taxa17, (
                              •   (shape of stamen, cleft)
                              •   (length of stamen, long-relative to ovum)
                              •   (number of petals, 12)
                              •   (petal type,….)))


  If all the terms are defined in an ontology,
      what’s left for the relational data model?
                                                                              24
Given: Morphster must import ontologies




  All the “data” is best served by ontological representation




                                                                25
Scientific Study Comprises:


                        • Identifying a putative character

                        • Defining exemplars of the states.
                              – Text
                              – Image
                              – Image annotations



                        • Associating a Taxon with a
                          character state
                              – Also image driven




                                                              26
Oops
• Morphster must include ontology function
• The DBMS function better served as an ontology
• The system is big enough,
   – Let’s get rid of the DBMS.




                                                   27
Which ontology language, OBO or OWL?

• Most ATOL projects are OBO
  – Mature but limited technology
  – Basis of Gene Ontology (GO)
  – 60 other ontologies


• Most Computer Scientists are working on OWL
  – Larger, more powerful infrastructure
  – More jobs



• Answer: Yes:

                                                28
OBO --> OWL --> OBO Round Trip
• OWL (meaning the semantic web) is strictly more
  powerful than OBO




                                                    29
Two Layer Cake Methodology


• Recall, Semantic Web is organized as a hierarchy of
  languages:




                                                        30
Two Layer Cake Methodology
• Computational Linguistic Approach (Grammar)
  – For each OBO construct,
     • determine the least expressive layer in the OWL stack necessary to
       express the construct.




                                                                            31
Result of the Analysis 1:




   Examples of Elements         OBO Layer Cake        Semantic Web Layer Cake




   Subsets of the OBO language
   • organized into a hierarchy of expressive power
   • direct correspondence to the Semantic Web



                                                                                32
Result of the Analysis 2
• Identification of OBO constructs without isomorphism
  to OWL




                                                         33
Mismatched Constructs
• OWL rigorous embodiment of Globally unique
  identifiers and name spaces.

• OBO has built-in synonym concepts.



• OBO has a built-in concept of ontology subset.
   – Anticipated disagreements
   – Not addressed in OWL




                                                   34
GUID Issue Most Interesting


• Class identifiers and namespaces need global Ids
   – As we map class identifiers and namespaces to OWL
      • Record the mapping in the OWL ontology.
      • Provides the information to losslessly reverse the transfomation




                                                                           35
Image as Class Exemplar (not)
Consider: inheritance                 Consider: image exemplar

                                                            Appears_in
                        has                    polygon
         polygon              sides                                      Poly.jpg




 rectangle         triangle            rectangle         triangle




• By inheritance,                     • The hexagon is not an
     – Rectangle has sides              example of a rectangle or
     – Triangle has sides               triangle.



                                                                                    36
LSIDs: The Social Contract
• Collect and analyze data
   – Results are permanent
   – Experiments may be duplicated


• Today experiments comprise
   – Collecting data from many places
   – Long data analysis pipeline

   – Data is published on web pages that disappear.




                                                      37
Laboratory Notebook vs. Computer File

But will the computer files be there in 40 years?

Is this important?
• Old data
   – Can be revisited for new results
   – Solve controversy, (both scientific and personal)

• Example: Photo 51
   – Watson, Crick &Wilkens received the Nobel prize for the DNA double-
     helix
   – Woman scientists concerned, Rosalind Franklyn did not get enough
     credit.
   – Look at her notebook page with “Photo 51”. Its clear, she may deserve
     the most credit; history books are being rewritten
       •   http://www.pbs.org/wgbh/nova/photo51/
       •   http://www.chemheritage.org/classroom/chemach/pharmaceuticals/watson-crick.html

                                                                                             38
Example: The Morphster Project
• Morphster is part of the Assembling the Tree of Life
  grand challenge problem.
   – Build an evolutionary (phylogenetic) tree for all the species
     on the planet.




                                                                     39
What are the parts of the LSID protocol?
1.   Service Oriented Architecture
     •   Both SOAP/WSDL and Get/Post (REST) interfaces

2.   Assignment Services
     •   Assign an LSID to a data set

3.   Discovery Service
     •   Special places to determine where to find service for an LSID, (like
         DNS services)

4.   Resolution Service
     •   Details what services are associated with an LSID

5.   Retrieval Service
     •   getData(LSID)
     •   getMetaData(LSID)
                                                                                40
Simple version:
1. Machinery to make a simple distributed computing
   idea to work.
2. Given data D
   •   New LSID L = assignLSID(D)

   •   getData(L), returns D
   •   getMetadata(L), return data type of D




                                                      41
Complete Interface from the Standard Document




                       QuickTime™ and a
                   TIFF (LZW) decompressor
                are neede d to see this picture.




                                                   42
Syntax of the LSID String
Fields:
• "URN"
•    "LSID"
•    authority identification
•    namespace identification
•    object identification
•    optionally: revision identification.

Examples:
URN:LSID:ebi.ac.uk:SWISSPROT.accession:P34355:3
URN:LSID:rcsb.org:PDB:1D4X:22

Notice:
• Use of domain names (e.g. ebi.ac.uk) for authorities
    –   Enables using DNS until LSID discovery services are built
        •   Good to get started,
        •   Bad, people think LSIDs are URLs.


                                                                    43
Basic Rules of the Protocol
• Data can never be changed
• Data can have versions
   – Old versions never disappear (or change)
   – getData(LSID), if no version number, newest version is
     returned.
• Metadata
   – Can be changed
   – Can be available in different syntactic formats
• Data
   – Can be available in different syntactic formats
      • But must promise to be the identical



                                                              44
Schema Driven Assignment of LSIDs
• Recognize most data is in relational database
  management system.

• “Exporting” part of a database is a standard database
  concept.

• Define export part of a database using an
                       export schema
• Replace assignment service by automatically assigning
  an LSID to each exported row.
   – Does not violate the protocol, provided the rest of the
     protocol is correctly implemented

                                                               45
The Full Picture: Could be Implemented by a Compiler


      CREATE     LSIDVIEW SpecimenImage AS
      SELECT     scientific_name, specimen_image
      FROM       Specimen                                Legacy Database Catalog




                                             Compiler
                 Triggers
                                                                             RDF
                                                                        Descriptions of
                                                                           the Data
                                         Ancillary Table
                                     Definitions to Implement       Data           MetaData
                                        LSID Resolution
      Legacy
      Database                              Archived                         LSID
                                              Data                         Resolution
                                                                            Runtime




                                                                                              46
Example:
• Suppose we have a table: DrugScreen
   Protein-         Sequenc       Spectra Date-          Drug-
   Name             e                     Created        Interaction
   Hemoglobin x                   y       10/27/06       Yes
   Insulin          S             T       12/01/05       No



• We want to make it available
    – But leave drug-interaction out, it is private
              Protein-        Sequence Spectra       Date-Created
              Name
              Hemoglobin      x           y          10/27/06
              Insulin         S           T          12/01/05


    – Export schema (Protein-name,Sequence, Spectra,Data)


                                                                       47
Created a Language to Specify This:
   Create LSIDVIEW PublicSceenData as
   Select Protein-name,Sequence, Spectra,Data
   From Drug-Screen
   Where ;


Internally we build a table:

       LSID     Protein-     Sequence Spectra   Date-Created
                Name
       Psd-1    Hemoglobin   x        y         10/27/06
       Psd-2    Insulin      S        T         12/01/05
       …        …




                                                               48
Database Triggers
• SQL programming construct
      Create Trigger ….


• On each insert or update to    Insert new row
  a table
   – A user defined program is                    Trigger: Do
     called                                       instead:




                                                                49
Relational Databases Implement Triggers
• Provides an easy implementation

Create Trigger populateLSIDdata
On DrugScreen
Instead of Insert
         // do the original insert
         // also insert into the LSIDtable




• Copy new data into separate tables
   – Helps us guarantee no changes to the data
   – Everything in the existing database stays the same


                                                          50
Laboratory Notebooks in Life Science

Given an experiment:
• Laboratory Notebook

   – Records the design of experiment
   – Chronicles the execution of an experiment
       • Time of each step
       • Anything unusual about each step
       • Observations of how things are proceeding
   – Records the final measurements and other results



• Means to duplicate (repeat) experiments



                                                        51