Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

The Joy of Ontology by UdE9fJ

VIEWS: 17 PAGES: 152

									The Joy of Ontology

     Suzanna Lewis
    SMI Colloquium
     April 20th, 2006
                Sections
 Why make an ontology
 What is an ontology
 How to create an ontology
   Logically
   Technically
   Organizationally
 National Center for Biomedical Ontology
 Case study: Phenotypes, our current work
  on OBD
Why make an ontology?

    What is the motivation?
  The Problem(s) with data
Inaccessibility of widely distributed
 data
Over abundance of information
Speed and performance
Interpreting the data syntactically
Interpreting the data semantically
       We started the GO
 To develop a shared language adequate for the
  annotation of molecular characteristics across
  organisms.
 To agree on a mutual understanding of the
  definition and meaning of any word used. and
  thus to support cross-database queries.
 To provide database access via these common
  terms to gene product annotations and
  associated sequences.
Annotation of Yeast Microarray
      Clusters Using GO
     GENE    PROCESS                     FUNCTION                          CELLULAR COMPONENT
     SDH1    tricarboxylic acid cycle    succinate dehydrogenase           mitochondrial inner membrane

     NDI1    oxidative phosphorylation   NADH dehydrogenase                mitochondrial inner membrane
                                         ubiquinol--cytochrome-c
     QCR7    electron transport                                            mitochondrial inner membrane
                                         reductase
     COX6    oxidative phosphorylation   cytochrome-c oxidase              mitochondrial inner membrane

     RIP1    electron transport          Rieske Fe-S protein               mitochondrial inner membrane

     COX15   oxidative phosphorylation   cytochrome-c oxidase              mitochondrial inner membrane
     CYT1    electron transport          cytochrome-c1                     mitochondrial inner membrane
                                         ubiquinol--cytochrome-c
     COR1    electron transport                                            mitochondrial inner membrane
                                         reductase
     SDH3    tricarboxylic acid cycle    succinate dehydrogenase subunit   mitochondrial inner membrane
                                      of Eisen et al. (1998). Cluster analysis membrane
Microarray data from Figure 2K succinate dehydrogenase subunit mitochondrial innerand
      SDH4  tricarboxylic acid cycle
                                                                     Acad. Sci. 95
display of genome-wide expression patterns, Proc. Natl. mitochondrial inner membrane
      SDH2  tricarboxylic acid cycle succinate dehydrogenase subunit
(25): 14863-14868. acid cycle
      MDH1  tricarboxylic            malate dehydrogenase            mitochondrial matrix
                                         ubiquinol--cytochrome-c
     QCR6    electron transport                                            mitochondrial inner membrane
                                         reductase
     CYT1    electron transport          cytochrome-c1                     mitochondrial inner membrane
The Challenge of Communication
    Ontologies are essential to
make sense of biomedical data




         QuickTime™ and a
TIFF (Uncompressed) decompressor
   are needed to see this picture.
A Portion of the OBO Library




                 QuickTime™ and a
             TIFF (LZW) decompressor
          are neede d to see this picture.
    Motivation: to capture
      biological reality
 Inferences and decisions we make are
  based upon what we know of the
  biological reality.
 An ontology is a computable
  representation of this underlying
  biological reality.
 Enables a computer to reason over the
  data in (some of) the ways that we do.
What is an ontology
  Ontology (as a branch of
        philosophy)
 The science of what exists in every area of
  reality
 The classification of entities: what kinds of
  things exist
 The relations between these entities
 Defines a scientific field's vocabulary and
  the canonical formulations of its theories.
 Seeks to solve problems which arise in
  these domains.
    A biological ontology is:
A machine interpretable representation
 of some aspect of biological reality

  what kinds         eye disc         sense organ
   of things
   exist?          develops                  is_a
  what are the        from
                                 eye
   relationships
   between                         part_of
   these things?
                              ommatidium
     Entity: a definition
anything which exists, including
things and processes, functions and
qualities, beliefs and actions,
software and images
Representation: a definition
An image, idea, map, picture,
 name, description ... which refers to,
 or is intended to refer to, some entity
 or entities in reality

this ‘or is intended to refer to’ should
 always be assumed
Ontologies represent types in
           reality




    ontology         reality
Scientific data represent
   instances in reality
  Two kinds of representational
             artifact
Databases, inventories, images:
 represent what is particular in reality =
 instances

Ontologies, terminologies, catalogs:
 represent what is general in reality
 (exists in multiple instances) = types
 (universals, kinds)
Ontologies are not for representing
   concepts in people’s heads

   reality




             ontology
    The researcher has a cognitive
  representation of what is general,
    based on his knowledge of the
  Cognitive
               science
representation




           ontology
types                                substance

                              organism

                         animal

                mammal

          cat
                                         frog
siamese



instances
   An ontology is like a scientific text;
it is a representation of types in reality
  Cognitive
representation




   Ontology = a              reality
   representation of types
  Atomic representational
     unit: a definition
 terms, icons, bar codes, alphanumeric
  identifiers ... which
  1. refer, or are intended to refer, to entities in
     reality, and
  2. are not built out of further sub-
     representations


 Representational units are the atoms in
  the domain of representations
Modular representational
   unit: a definition

A representation which is built out of
other representational units, which
together form a structure that mirrors
a corresponding structure in reality
        Periodic Table
The Periodic Table
    Ontology: a definition
 A modular, representational
  artifact whose representational
  units are intended to represent
  1. types in reality
  2. the relations between these types
     which are true universally (i.e. for all
     instances)
     lung is_a anatomical structure
     lobe of lung part_of lung
How to create an ontology

             Part 1
     The logic and science
In computer science, there is an
  information handling problem
   Different groups of data-gatherers
    develop their own idiosyncratic terms in
    which they represent information.
   To put this information together, methods
    must be found to resolve terminological
    and conceptual incompatibilities.
   Again, and again, and again…
      The Reality
Do not assume that data integration
can be brought about by somehow
‘mapping’ incompatible, low quality
ontologies built for different
purposes
   Two flavors of ontology
1. Application ontology

2. Reference ontology
   Application Ontology
An application ontology is
comparable to an engineering
artifact such as a software tool. It is
constructed for specific practical
purposes.
   Reference Ontology
A reference ontology is analogous
to a scientific theory; it seeks to
optimize representational adequacy
to its subject matter
            Assumptions
 There are best practices in ontology
  development which should be followed
  to create stable high-quality ontologies

 Shared high quality ontologies foster
  cross-disciplinary and cross-domain re-use
  of data, and create larger communities
Why do we need rules/standards
     for good ontology?
 Ontologies must be intelligible both to humans
  (for annotation) and to machines (for reasoning
  and error-checking)
 Unintuitive rules lead to errors in classification
 Simple, intuitive rules facilitate training of
  curators and annotators
 Common rules allow alignment with other
  ontologies (and thus cross-domain exploitation
  of data)
 Logically coherent rules enhance harvesting of
  content through automatic reasoning systems
 Ontologies built according to
common logically coherent rules
   Will make entry easier and yield a safer
    growth path

   You can start small, annotating your data
    with initial fragments of a well-founded
    ontology, confident that the results will still
    be usable when the ontology grows
    larger and richer
         The OBO Foundry
           OBO Foundry

A subset of OBO ontologies whose developers
agree in advance to accept a common set of
principles designed to assure
 intelligibility to biologist curators, annotators, users
 formal robustness
 stability
 compatibility
 interoperability
 support for logic-based reasoning
       The OBO Foundry

1. The ontology is open and available to be used by
   all.
2. The developers of the ontology agree in
   advance to collaborate with developers of other
   OBO Foundry ontology where domains overlap.
3. The ontology is in, or can be instantiated in, a
   common formal language.
4. The ontology possesses a unique identifier space
   within OBO.
5. The ontology provider has procedures for
   identifying distinct successive versions.
       The OBO Foundry

6. The ontology has a clearly specified and clearly
    delineated content.
7. The ontology includes textual definitions for all
    terms.
8. The ontology is well-documented.
9. The ontology has a plurality of independent
    users.
10. The ontology uses relations which are
    unambiguously defined following the pattern of
    definitions laid down in the OBO Relation
    Ontology.
       Orthogonality
        Orthogonality
Ontology groups who choose to be
 part of the OBO Foundry thereby
 commit themselves to collaborating
 to resolve disagreements which arise
 where their respective domains
 overlap
    agreed on relations
 The success of ontology alignment demands
  that ontological relations (is_a, part_of, ...)
  have the same meanings in the different
  ontologies to be aligned.
 See “Relations in Biomedical Ontologies”,
  Genome Biology May 2005, Barry Smith ,
  Werner Ceusters, Bert Klagges, Jacob
  Köhler, Anand Kumar, Jane Lomax, Chris
  Mungall, Fabian Neuhaus, Alan L Rector,
  and Cornelius Rosse
Three fundamental dichotomies

    continuants vs. occurrents
   dependent vs. independent
       types vs. instances
     ONTOLOGIES ARE
 REPRESENTATIVES OF TYPES
        IN REALITY
    For example in the GO
 Molecules, cell components , organisms are
  independent continuants which have functions
 Functions are dependent continuants which
  become realized through special sorts of
  processes we call functionings
 Processes are occurrents include: functionings,
  side-effects, stochastic processes
Continuants (aka endurants)

have continuous existence in time
preserve their identity through
 change
exist in toto whenever they exist at
 all
  Occurrents (aka processes)

have temporal parts
unfold themselves in successive
 phases
exist only in their phases
Continuants vs. Occurrents
    Anatomy vs. Physiology
       Snapshot vs. Video
         Stocks vs. Flows
    Commodities vs. Services
     Products vs. Processes
      Dependent entities


require independent continuants as
 their bearers

There is no grin without a cat
      Dependent vs.
 independent continuants
Independent continuants
 (organisms, cells, molecules,
 environments)

Dependent continuants (qualities,
 shapes, roles, propensities, functions)
  E.g. the acidity of this gut
All occurrents are dependent
           entities

They are dependent on those
 independent continuants which are
 their participants (agents, patients,
 media ...)
     GO’s three ontologies

                        occurent

molecular                          biological
function        dependent           process




   continuant
                  cellular
                component    independent
                             organism-
molecular       cellular        level
process        process       biological
                              process

functioning   functioning   functioning
                                          Pumping
                                          blood

                             organism-
molecular       cellular        level
function       function      biological
                             function

                                          To pump
                                          blood



                cellular
 molecule                    organism
              component


                                          heart
            UBO
  Upper Biomedical Ontology
       Continuant: 3D                 Occurrent: 4D


Independent    Dependent                      Side-Effect,
                              Functioning
 Continuant    Continuant                     Stochastic
                                              Process, ...
                            Spatial
    Quality    Function
                            Region



     instances (in space and time)
How to create an ontology

             Part 2
     The technical aspects
Separate the Database from
       the Ontology

For extensibility
For generality
For reasonability
For interoperability
Why ontologies are worth it
 Minimize database maintenance costs
 Communication between researchers

 As well as
   Better query facilities
   Ability to draw inferences
   Detect correlations
   Facilitate computational interpretation of text
   And more…
Before: domain knowledge is
embedded in the db schema
            Gene
            table
                      Exon
                      table

              RNA
             table


            Protein
             table
Embedding domain knowledge
 in the db schema is expensive

 The logical description and the physical
  database description of the biology are co-
  mingled
 Therefore new biological knowledge will force:
      Schema changes: e.g. new tables
      Query changes: that explicitly refer to tables
      Middleware changes: to retrieve and format
      GUI changes: to display
After: domain knowledge is
embedded in the ontology


   feature
    table
Ontology driven db schema is
 less expensive to maintain
 The logical description and the physical
  database description of the biology are
  developed independently
 Therefore new biological knowledge will
  only require:
   Ontology changes: e.g. new terms
   GUI changes: display
   No schema changes
   No query changes
   No middleware changes
Reality:
ultimately
this is what
the ontology
must reflect
Step 1:
Build an ontology
that reflects reality
                       Observations
                       and
                       experimentation

Step 2: Data capture

                       Patient IDs
     Database:         Sequence accessions
    UIDs serving       Genetic strain IDs
    as proxies for     …
     instances
                                Step 1:
                                Build an ontology
                                that reflects reality




Step 2: Data capture
                       Step 3:
                       Classify data
                       using the
     Database:
                       ontology
    UIDs serving
    as proxies for
     instances
      Ontology is a contract
      between researchers

 A common language that allows us to share
  knowledge

 Researchers have a stake in it:
    Every individual will benefit by being able to
     accurately interpret someone else’s data
    No more pre-scrubbing
    No more time spent translating
          Rules on types
 Don’t confuse types with instances
 Don’t confuse instances with leaf nodes
 Don’t confuse types with ideas
 Don’t confuse types with ways of getting
  to know types
 Don’t confuse types with ways of talking
  about types
 Don’t confuse types with data about
  types
An astronomy ontology
  should not include
      'Buzz Aldrin'
         Rules on terms
Terms should be in the singular
Avoid abbreviations even when it is
 clear in context what they mean
 (‘breast’ for ‘breast tumor’)
Think of each term ‘A ’ in an
 ontology is shorthand for a term of
 the form:
‘the type A ’
       Rules on Definitions
 The terms used in a definition should be
  simpler (more intelligible) than the term to
  be defined; otherwise the definition
  provides no assistance
 Definitions should be intelligible to both
  machines and humans
   to human understanding
      Humans need clarity and modularity
   to machine processing
      Machines can cope with the full formal
       representation
      Confusing non-useful
          definitions
 Swimming
   Swimming is healthy and has 8 letters


 Poland
   The name of Poland
When defining terms use
 Aristotelian definitions
The definition of ‘A’ takes the form:
   an A =def. is a B which ...
   where B is A’s parent in the hierarchy


A human being =def. an animal which is
rational
A helicase =def. an enzyme which
catalyzes the hydrolysis of ATP to unwind
the DNA helix
        Use of Aristotelian
           definitions
 Makes defining terms easier
 Each definition encapsulates in modular
  form the entire parentage of the defined
  term
 The entire information content of the
  FMA’s term hierarchy and definitions can
  be translated very cleanly into a
  computer representation
 Now accepted by GO
Summary: How to build your
       ontology
 Keep It Simple:
   lowest possible barrier to entry
 Technology independence
 Clarity of definitions and scope
 “With new data, we change our minds”
   An ontology must adapt to reflect current
    understanding of reality
   Mechanisms to support change
How to build an ontology

            Part 3
      The sociology and
    organizational aspects
  Elements for Success: GO
 A Community with a common vision
 A pool of talented and motivated
  developers/scientists
 A mix of academic and commercial
 An organized, light weight approach to
  product development
 A leadership structure
 Communication
 A well-defined scope, (our “business”)
                      Adopted from “Open Source Menu for Success”
         Gene Ontology
Community annotation production
“Extreme Programming” techniques
 to distributed ontology generation
  Revision control
  Nightly conflict resolution
  Users are integral to the team
  Rapid iterations
     Why       Survey


               Domain
               covered
                  ?
      Public
        ?


                            Community?
               Active
                 ?

Salvage
                              Develop
               Applied
                 ?
yes                           Improve
no

                         Collaborate & Learn
  Is your domain covered?
Due diligence & background
           research

Step 1: Learn what is out there
  The most comprehensive list is on the
   OBO site. http://obo.sourceforge.net
Assess ontologies critically and
 realistically.
       It is privately held?
    Ontologies must be shared
 Communities form scientific theories
   that seek to explain all of the existing evidence
   and can be used for prediction
 These communities are all directed to the same
  biological reality, but have their own perspective
 The computable representation must be shared
 Ontology development is inherently
  collaborative
 Open ontologies become connected to
  instance data & this feeds back on ontology
  development
       Is it active?
Pragmatic assessment of an
         ontology

Is there access to help, e.g.:
  help-me@weird.ontology.net ?
Does a warm body answer help mail
 within a ‘reasonable’ time—say 2
 working days ?
        Is it applied?
Toy ontologies are not useful
 Every ontology improves when it is applied to
  actual instances of data
 It improves even more when these data are
  used to answer research questions
 There will be fewer problems in the ontology and
  more commitment to fixing remaining problems
  when important research data is involved that
  scientists depend upon
 Be very wary of ontologies that have never been
  applied
 Work with that community
To improve (if you found one)
To develop (if you did not)
                            Improve
How?


                          Collaborate
                          and Learn
 Community vs. Committee
           ?
 Many people have (understandably)
  reservations about collaborative
  development, because it can easily be
  confused with design-by-committee
  projects.
 Members of a committee represent
  themselves.
 Members of a community represent their
  community.
 Design for purpose - not in
          abstract
Who will use it?
  If no one is interested, then go back to
   bed
What will they use it for?
  Define the domain
Who will maintain it?
  Be pragmatic and modest
   Start with a concrete
proposal —not a blank slate.
 But do not commit your ego to it.
 Distribute to a small group you respect:
   With a shared commitment.
   With broad domain knowledge.
   Who will engage in vigorous debate without
    engaging their egos (or, at least not too
    much).
   Who will do concrete work.
                      Step 1:
 Alpha0: the first proposal - broad in breadth but
  shallow in depth. By one person with broad
  domain knowledge.
    Distribute to a small group (<6).
    Get together for two days and engage in vigorous
     discussion. Be open and frank. Argue, but do not be
     dogmatic.
 Reiterate over a period of months. Do as much
  as possible face-to-face, rather than by
  phone/email. Meet for 2 days every 3 months or
  so.
                Step 2:
Distribute Alpha1 to your group.
All now test this Alpha1 in real life
  By classifying representations of
   instances with these types
Do not worry that (at this stage) you
 if do not have tools.
              Step 3:
Reconvene as a group for two days.
Share experiences from
 implementation:
  Can your Alpha1 be implemented in a
   useful way ?
  What are the conceptual problems ?
  What are the structural problems ?
                   Step 4:
 Establish a mechanism for change.
   Use CVS or Subversion.
   Limit the number of editors with write
    permission (ideally to one person).
 Unique stable identifiers
 History tracking of changes
 Release a Beta1.
 Seriously implement Beta1 in real life.
 Build the ontology in depth.
               Step 5:
After about 6 months reconvene
 and evaluate.
Is the ontology suited to its purpose ?
Is it, in practice, usable ?
Are we happy about its broad
 structure and content ?
                Step 6:
Go public.
  Release ontology to community.
  Release the products of the
   annotations.
  Invite broad community input and
   establish a mechanism for this (e.g.
   SourceForge).
                Step 7:
Proselytize
  Publish in a high profile journal
  Engage new user groups
Emphasize openness
Write a grant
                        Step 8:
 Iterate
 Improvements come
  in two forms                     Improve
   It is impossible to get it
    right the 1st (or 2nd, or
    3rd, …) time.
   What we know about
    reality is continually
    growing                       Collaborate
                                  and Learn
   Step 9:



Bon appetit
      Use the power of
combination and collaboration
  as far as possible don’t reinvent
  ontologies are like telephones: they are
   valuable only to the degree that they are
   used and networked with other
   ontologies
  but choose working telephones
  most telephones were broken when the
   technology was first being developed
The National Center for
 Biomedical Ontology
     Stanford – Berkeley
  Mayo – Victoria – Buffalo
 UCSF – Oregon – Cambridge
Bioinformatics and Computational Biology
                      QuickTime™ and a
                  TIFF (LZW) decompressor
               are neede d to see this picture.
    National Centers for
Biomedical Computing—2005
1. National Center for Integrative
   Biomedical Informatics (Michigan)
2. National Center for Multi-Scale
   Study of Cellular Networks
   (Columbia)
3. National Center for Biomedical
   Ontology
 Stanford: Tools for ontology alignment,
  indexing, and management (Cores 1, 4–7:
  Mark Musen)
 Lawrence–Berkeley Labs: Tools to use
  ontologies for data annotation (Cores 2, 5–7:
  Suzanna Lewis)
 Mayo Clinic: Tools for access to large
  controlled terminologies (Core 1: Chris Chute)
 Victoria: Tools for ontology and data
  visualization (Cores 1 and 2: Margaret-Anne
  Story)
 University at Buffalo: Dissemination of best
  practices for ontology engineering (Core 6:
  Barry Smith)
   Driving Biological Projects

Trial Bank: UCSF, Ida Sim             Qu i ckTi me ™ an d a
                                   TIFF (LZW) de co mpre ss or
                               a re ne ed ed to se e thi s p i cture .




Flybase: Cambridge, Michael                     QuickTime™ a nd a
                                             TIFF (LZW) de compressor
                                          are nee ded to see this picture.




 Ashburner

ZFIN: Oregon, Monte Westerfield
                                                           Q uic kT im e™ an d a
                                               T IFF ( Un co m pr e ss e d) d ec o mp r es s or
                                                   a re n ee d ed to s e e th is p ictu r e.
Case study: Phenotypes,
our current work on OBD
Animal disease models
              Animal models

              Mutant Gene


            Mutant or missing
                Protein


            Mutant Phenotype
   Animal disease models
    Humans            Animal models

  Mutant Gene         Mutant Gene


Mutant or missing   Mutant or missing
    Protein             Protein


Mutant Phenotype    Mutant Phenotype
    (disease)        (disease model)
   Animal disease models
    Humans            Animal models

  Mutant Gene         Mutant Gene


Mutant or missing   Mutant or missing
    Protein             Protein


Mutant Phenotype    Mutant Phenotype
    (disease)        (disease model)
   Animal disease models
    Humans            Animal models

  Mutant Gene         Mutant Gene


Mutant or missing   Mutant or missing
    Protein             Protein


Mutant Phenotype    Mutant Phenotype
    (disease)        (disease model)
SHH-/+   SHH-/-




shh-/+   shh-/-
Phenotype
(clinical sign) = entity   + quality
Phenotype
(clinical sign) = entity   + quality
     P1       = eye        + hypoteloric
Phenotype
(clinical sign) = entity+ quality
     P1       = eye     + hypoteloric
     P2       = midface + hypoplastic
Phenotype
(clinical sign) = entity+   quality
     P1       = eye     +   hypoteloric
     P2       = midface +   hypoplastic
     P3       = kidney  +   hypertrophied
Phenotype
(clinical sign) = entity+      quality
     P1       = eye     +      hypoteloric
     P2       = midface +      hypoplastic
     P3       = kidney  +      hypertrophied

     ZFIN:                      PATO:
       eye                       hypoteloric
       midface             +     hypoplastic
       kidney                    hypertrophied
Phenotype
(clinical sign) = entity   + quality


Anatomical ontology
Cell & tissue ontology
Developmental ontology
                           +      PATO
Gene ontology              (phenotype and trait ontology)
  biological process
  cellular component
Phenotype
(clinical sign) = entity +   quality
     P1        = eye     +   hypoteloric
     P2        = midface +   hypoplastic
     P3        = kidney  +   hypertrophied


  Syndrome = P1 + P2 + P3
   (disease)
               = holoprosencephaly
 Human holo-    Zebrafish   Zebrafish
prosencephaly     shh         oep
 OMIM     ZFIN    FlyBase     FlyBase    ZFIN     mouse   rat   SNO          OMIM disease
 gene     gene     gene       mut pub   mut pub                 MED

LAMB1    lamb1    LanB1         5         15                    39                   -

FECH     fech     Ferro-        2          5        2           29     Protoporphyria, Erythropoietic
                  chelatase
GLI2     gli2a    ci            388       41                    22                   -

SLC4A1   slc4a1   CG8177        7          7                    19    Renal Tubular Acidosis, RTADR

MYO7A    myo7a    ck            84         5        9     3     16     Deafness; DFNB2; DFNA11

ALAS2    alas2    Alas          1          7                    14    Anemia, Sideroblastic, X-Linked

KCNH2    kcnh2    sei           27         3                    12                   -

MYH6     myh6     Mhc           166        3        1           12       Cardiomyopathy, Familial
                                                                           Hypertrophic; CMH

TP53     tp53     p53           64         3        3     19    11            Breast Cancer

ATP2A1   atp2a1   Ca-P60A       32         6              1     11           Brody Myopathy

EYA1     eya1     eya           251        5        4            6      Branchiootorenal Dysplasia

SOX10    sox10    Sox100B       1         17        4            4    Waardenburg-Shah Syndrome
National Center for Biomedical Ontology
                 Capture and index experimental results


            Open
                                                 Open
         Biomedical
                                              Biomedical
         Ontologies
                                              Data (OBD)
           (OBO)


 Revise                                                   Relate
 biomedical                                               experimental
 understanding                                            data to results
                              BioPortal
                                                          from other
                                                          sources
            Phenotype as an
              observation
  context         The class of     evidence
                thing observed
                                 publication
environment                      figure



                                 assay
  genetic                        sequence ID




                  ontology
Review of proposed EAV EQ
          model
A phenotype is described using an
 Entity-Quality double
Entities are drawn from various OBO
 ontologies—cell, anatomies, GO, …
Qualities are drawn from one
 ontology—PATO
   Separation of concerns
Not phenotypes:
  Genotype
  Environment
  Assay, measurement systems
  Images       schema:
 Association = Genotype Phenotype Environment Assay
 Phenotype = Entity Quality
 Entity = OBOClassID
 Quality = PATOClassID
               2003 Pilot study
 Trial of EAV model on small collection of
  genotypes
    FlyBase
    ZFIN
 Genes were non-orthologous
 New curations - in progress
    orthologous genes with clinical relevance
    Use the same data model and exchange format?
Example data records

Genotype Entity   Value
npo      gut      dysplastic
         gut      small
r210     retina   irregular
         brain    fused
 ZFIN schema extension: stages

Genotype   Stage                Entity   Quality
npo        Hatching:Pec-fin     gut      dysplastic
           Hatching:Pec-fin     gut      small
r210       Hatching:Long-pec    retina   irregular

           Larval:Protruding-   brain    fused
           mouth
                        Stages

Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality
Entity = OBOClassID
Stage = OBOAnatomicalStageClassID
Quality = PATOClassID




                                      * means zero or more
    Monadic and relational
          qualities
 Monadic:
    the quality inheres in a single entity
 Relational:
    the quality inheres in two or more entities
        sensitivity of an organism to a kind of drug
        sensitivity of an eye to a wavelength of light
    can turn relational qualities into cross-product monadic
     qualities
        e.g. sensitivityToRedLight
        better to use relational qualities
            avoids redundancy with existing ontologies
         Incorporating relational
                qualities

Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality Entity*
Entity = OBOClassID
Quality = PATOVersion2ClassID



  Example data record:
  Phenotype = “organism” sensitiveTo “puromycin”
          Measurable qualities
 Some qualities are inexact and implicitly relative to a
  wild-type or normal quality
    relatively short, relatively long, relatively reduced
    easier than explicitly representing:
        this tail length shorter-than ‘mouse’ wild-type tail length
 Some qualities are determinable
    use a measure function
        unit, value, {time}
    this tail has length L
        measure(L, cm) = 2
 Keep measurements separate from (but linked to)
  quality ontology
   Incorporating measurements

Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Quality Entity* Measurement*
Measurement = Unit Value (Time)
Entity = OBOClassID
Quality = PATOVersion2ClassID



     Example data record:
     Phenotype = “gut” “acidic” Measurement = “pH” 5
The Methodology of Annotations
   Scientific curators use experimental
    observations reported in the biomedical
    literature to link gene products with GO
    terms in annotations.
   The gene annotations taken together
    yield a slowly growing computer-
    interpretable map of biological reality,
   The process of annotating literature also
    leads to improvements and extensions of
    the ontology itself, which institutes a
    virtuous cycle of improvement in the
    quality and reach of future annotations
    and of future versions of the ontology.
   When we annotate the
  record of an experiment
we use terms representing types to
 capture what we learn about the
 instances
  this experiment as a whole (a process)
  these instances experimented upon
    the instances are typical
    they are representatives of a type
              Ontology

 A thing of beauty is a joy forever
With acknowledgement and thanks to: Mark
Musen, Barry Smith, Sima Misra, Chris Mungall,
        Daniel Rubin, and David Hill
Interpretation of the schema
How should EAV data records be
 interpreted by a computer?
  What are the instances?
Is the EAV schema just to improve
 database searching?
  Can it be used for meaningful cross-
   species comparisons?
What is the Entity slot used for?
Genotype      Entity                         Quality
npo           gut                            dysplastic
              gut                            small
r210          retina                         irregular
              brain                          fused
tm84          d/v pattern formation          abnormal

              blood islands                  number increased
Bsb[2]        elongation of arista literal   arrested
C-alpha[1D]   adult behaviour                uncoordinated


                                             2003 trial data: FB & ZFIN
What is the Entity slot used for?
  In practical terms:
    An ID from one of the following ontologies
       GO CC, BP, and MF
       Species-specific anatomical ontology
       OBO Cell
    Or a cross-product
       e.g. acidification GO:0045851
          which has_locationOBOREL midgut FBbt00005383
       [example from FBal0062296: Acidification in the
        midgut of homozygous larvae is often less than in
        wild-type larvae]
  But what does it mean in the context of
   an annotation?
  Universals and particulars
 An ontology consist of universals (classes)
   Fruitfly, wing, flight
 Experimental data generally concerns
  particulars (instances) that instantiate
  universals
   this particular wing of this particular fruitfly
   this particular fruitfly participating in this
    particular flight from here to there
 In annotation we often use a class ID as a
  proxy for an (unnamed) instance (or
  collection of instances)
 It is important to always keep this
  distinction in mind
  What is the Quality slot used
              for?
Genotype      Entity                 Quality                 Quality
npo           gut                    structure               dysplastic
              gut                    relative size           small
r210          retina                 pattern                 irregular
              brain                  structure               fused
tm84          d/v pattern formation qualitative              abnormal

              blood islands          relative number         number
                                                             increased
Bsb[2]        elongation of arista   process                 arrested
              literal
C-alpha[1D]   adult behaviour        behavioral activity     uncoordinated


                                                           2003 trial data: FB & ZFIN
                  Qualities
 Treat as Qualities
   a dependent entity
      a quality must have independent entity(s) as bearer
      the quality inheres_in the bearer
 Examples
   The particular shape of this ball
   The particular structure of this wing
   The particular length of this tail
   The particular rate of synaptic transmission
    between these two neurons
    Attribute universals vs
     Attribute particulars




 In an EAV annotation, a PATO class ID typically serves as a proxy for
  an unnamed attribute instance
 Universals (classes) must always be defined in terms of their instances



   A Formal Theory of Substances, Qualities, and Universals
      Fabian NEUHAUS Pierre GRENON Barry SMITH
  How is the attribute slot used?
Genotype      Entity                 Attribute             Attribute
npo           gut                    structure             dysplastic
              gut                    relative size         small
r210          retina                 pattern               irregular
              brain                  structure             fused
tm84          d/v pattern formation qualitative            abnormal

              blood islands          relative number       number
                                                           increased
Bsb[2]        elongation of arista   process               arrested
              literal
C-alpha[1D]   adult behaviour        behavioral activity   uncoordinated
                 Current Model

Association = Genotype Phenotype Environment Assay
Phenotype = Stage* Entity Attribute
Entity = OBOClassID
Stage = OBOAnatomicalStageClassID
Attribute = PATOClassID




                                      * means zero or more
Composite phenotype classes
 Mammalian phenotype has composite
  phenotype classes
   e.g. “reduced B cell number”
 Compose at annotation time or ontology
  curation time?
   False dichotomy
 Core 2 will help map between composite
  class based annotation and EAV
  annotation
  Interpreting annotations
Annotations are data records
  typically use class IDs
  implicitly refer to instances
How do we map an annotation to
 instances?
Important for using annotations
 computationally
   Interpreting annotations (1)

 What does an EA (or EAV) annotation mean?
    Annotation:
       Genotype=“FBal00123” E=“brain” A=“fused”
    presumed implied meaning:
       this organism
           has_part x, where
               x instance_of “brain”
               x has_quality “fused”
    or in natural language:
       “this organism has a fused brain”
 Various built-in assumptions
Interpreting annotations (II)

 What does this mean:
   annotation:
      Genotype=“FBal00123” E=“wing” A=“absent”
   using same mapping as annotation I:
      fly98 has_part x, where
          x instance_of “wing”
          x has_quality “absent”
   or in natural language:
      this fly has a wing which is not there!
   What we really intend:
      NOT(this organism has_part x, where x instance_of
       “wing”)
    Interpreting annotations (II)

 What does this mean:
   annotation:
       Genotype=“FBal00123” E=“wing” A=“absent”
   using same mapping as annotation I:
       this organism has_part x, where
           x instance_of “wing”
           x has_quality “absent”
   or in natural language:
      this fly has a wing which is not there!
   What we really intend:
       this organism has_quality “wingless”
       “wingless” = the property of having count(has_part “wing”)=0
  Are our computational
representations intended to
      capture reality?
           Does this matter?

 If we simply use the colloquial expression
  “absent”
 What are the consequences?
    Basic search will be fine
       e.g. “find all wing phenotypes”
 Logical reasoners will compute incorrect results
    Computers will not be able to reason correctly
    We must explicitly provide specific rules for certain
     attributes such as “absent”
 Interpreting annotations (III)

 What does this mean:
   annotation:
      E=“digit” A=“supernumery”
   using same interpretation as annotation I:
      this organism has_part x, where
          x instance_of “digit”
          x has_quality “supernumery”
   or in natural language:
      this organism has a particular finger which is supernumery!!
   What we really intend:
      this person has_quality “supernumery finger”
      “supernumery finger” = the property of having
       count(has_part “digit”) > wild-type”
Interpreting annotations (IV)
 What does this mean:
   annotation:
     Gt=“mp001” E=“brown fat cell” A=“increased
       quantity”
   using same mapping as annotation I:
      this organism has_part x, where
          x instance_of “brown fat cell”
          x has_quality “increased quantity”
   or in natural language:
      this organism has a particular brown fat cell which is
       increased in quantity
   What we really intend:
      this organism has_part population_of(“brown fat cell”)
       which has_quality increased size
       Other use cases
Spermatocyte devoid of asters
Homeotic transformations
Increased distance between wing
 veins
Some vs. all
      Alternate perspectives
 process vs. state
    regulatory processes:
        acidification of midgut has_quality reduced rate
        midgut has_quality low acidity
 development vs. behavior
    wing development has_quality abnormal
    flight has_quality intermittent
 granularity (scale)
    chemical vs. molecular vs. cell vs. tissue vs. anatomical part

								
To top