; 04
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

04

VIEWS: 0 PAGES: 53

  • pg 1
									Data Representation and the
     Role of Ontologies


     PHAR 201/Bioinformatics I
        Philip E. Bourne
   Department of Pharmacology, UCSD


               reading: Genome
  • Prerequisite
  Research (2001) 11:1425-1433
              PHAR 201 Lecture 4, 2012   1
 Consider this Course a Workflow in How You Will Handle Data
        (Regardless of Type) For the Rest of Your Lives
 We Use Macromolecular Structure Data to Illustrate the Process
   And Hence Learn Structural Bioinformatics in the Process

                                                      Understand the
       Data In         Understand the scope           experiment to
                       and complexity of the          understand the
                               data                       errors



                         Understand the              Understand how
Recognize redundancy
                           methods to                to best represent
     In the data
                       physically instantiate        (model) the data
                            the model



                                                    Discover new science
 Classify the data         Analyze the data
                                                        From the data


                         PHAR 201 Lecture 4, 2012                          2
                      Agenda

• Before there were ontologies there was mmCIF

• Briefly review the history of ontology development

• Review the Gene Ontology (GO)
  – Motivation
  – Features
  – Related research activities around GO



                     PHAR 201 Lecture 4, 2012          3
             The PDB Format
• A full description is here
• It was designed around an 80 column punched
  card!
• It was designed to be human readable
• It is used by almost every piece of software that
  deals with structural data




                    PHAR 201 Lecture 4, 2012          4
      The PDB Format - Records
• Every PDB file may be broken into a number of lines
  terminated by an end-of-line indicator. Each line in the
  PDB entry file consists of 80 columns. The last character
  in each PDB entry should be an end-of-line indicator.

• Each line in the PDB file is self-identifying. The first six
  columns of every line contain a record name, left-justified
  and blank-filled. This must be an exact match to one of the
  stated record names.

• The PDB file may also be viewed as a collection of record
  types. Each record type consists of one or more lines.

• Each record type is further divided into fields.
                       PHAR 201 Lecture 4, 2012                  5
The PDB Format – An Example –
         The Header




          PHAR 201 Lecture 4, 2012   6
The PDB Format – An Example – The
       Atomic Coordinates




            PHAR 201 Lecture 4, 2012   7
The Description – Atom Records




           PHAR 201 Lecture 4, 2012   8
 What is Wrong with this Approach?
• The description and the data are separate
• Parsing is a nightmare – the most complex piece
  of code we have in our research laboratory
  probably remains the PDB parser
• There are no relationships between items of data
• Some data just cannot be parsed
• The fixed column format cannot represent some of
  today’s structures …


                  PHAR 201 Lecture 4, 2012           9
Structures are Spread Over Multiple Files –
     Most Users are Not Aware of this




                PHAR 201 Lecture 4, 2012   10
                     PDB Format -
  Important Components of the Data are Lost to
               All But Humans

REMARK   3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OF
REMARK   3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R
REMARK   3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THAN
REMARK   3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTAL
REMARK   3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0
REMARK   3 ANGSTROMS.


REMARK   4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITH
REMARK   4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANK
REMARK   4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLE
REMARK   4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTING
REMARK   4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY IN
REMARK   4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EB
REMARK   4 STRUCTURE (*2EBX*) WERE USED.

                         PHAR 201 Lecture 4, 2012                 11
   mmCIF Was Developed to
   Address these Problems


Methods in Enzymology. 1997 277, 571-590




              PHAR 201 Lecture 4, 2012   12
   mmCIF – Scope of the Initial
            Effort
• All PDB data should be captured
• Describe a paper’s material and methods
  section
• Describe biologically active molecule
• Fully describe secondary structure but not
  tertiary or quaternary
• Describe details of chemistry (inc. 2D)
• Meaningful 3D views

                  PHAR 201 Lecture 4, 2012     13
      mmCIF - Extract from a Data File
loop_
  _atom_site.group_PDB
  _atom_site.type_symbol
  _atom_site.label_atom_id
  _atom_site.label_comp_id
  _atom_site.label_asym_id
  _atom_site.label_seq_id
  _atom_site.label_alt_id
  _atom_site.Cartn_x
  _atom_site.Cartn_y
  _atom_site.Cartn_z
  _atom_site.occupancy
  _atom_site.B_iso_or_equiv
  _atom_site.footnote_id
  _atom_site.entity_id
  _atom_site.entity_seq_num
  _atom_site.id
 ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1
 ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2
 ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3

                             PHAR 201 Lecture 4, 2012            14
       mmCIF - Extract from the Dictionary
save__atom_site.Cartn_x
  _item_description.description
;        The x atom site coordinate in angstroms specified according to
         a set of orthogonal Cartesian axes related to the cell axes as
         specified by the description given in
         _atom_sites.Cartn_transform_axes.
;
  _item.name              '_atom_site.Cartn_x'
  _item.category_id          atom_site
  _item.mandatory_code            no
  _item_aliases.alias_name '_atom_site_Cartn_x'
  _item_aliases.dictionary      cifdic.c94
  _item_aliases.version        2.0
   loop_
  _item_dependent.dependent_name
                     '_atom_site.Cartn_y'
                     '_atom_site.Cartn_z'
  _item_related.related_name '_atom_site.Cartn_x_esd'
  _item_related.function_code associated_esd
  _item_sub_category.id         cartesian_coordinate
  _item_type.code            float
  _item_type_conditions.code esd
  _item_units.code           angstroms
                                PHAR 201 Lecture 4, 2012                  15
                 Summary
• mmCIF has provided the PDB with a robust data
  representation which serves as conceptual and
  physical schema upon which the current RCSB,
  PDBe and PDBj are built
• This work predated XML and XML-schema but
  embodies the important concepts inherent in these
  descriptions
• mmCIF was later exactly converted into XML and
  is now used more than mmCIF, but much less than
  the old PDB format
• PDB format will be phased out over a period of
  years
                   PHAR 201 Lecture 4, 2012           16
                      Agenda

• Before there were ontologies there was mmCIF

• Briefly review the history of ontology development

• Review the Gene Ontology (GO)
  – Motivation
  – Features
  – Related research activities around GO



                     PHAR 201 Lecture 4, 2012          17
Formal Definitions Taken from Knowledge
             Engineering ….
1. A systematic account of existence.

2. (From philosophy) An explicit formal specification of how to
   represent the objects, concepts and other entities that are
   assumed to exist in some area of interest and the relationships
   that hold among them.

3. For AI systems, what "exists" is that which can be represented.
   When the knowledge about a domain is represented in a
   declarative language, the set of objects that can be represented
   is called the universe of discourse. We can describe the ontology
   of a program by defining a set of representational terms.
   Definitions associate the names of entities in the universe of
   discourse (e.g. classes, relations, functions or other objects) with
   human-readable text describing what the names mean, and
   formal axioms that constrain the interpretation and well-formed
   use of these terms. Formally, an ontology is the statement of a
   logical theory.

                         PHAR 201 Lecture 4, 2012                     18
Formal Definitions Taken from Knowledge
        Engineering Continued

4. A set of agents that share the same ontology will be able to
   communicate about a domain of discourse without necessarily
   operating on a globally shared theory. We say that an agent
   commits to an ontology if its observable actions are consistent
   with the definitions in the ontology. The idea of ontological
   commitment is based on the Knowledge-Level perspective.

5. The hierarchical structuring of knowledge about things by
   subcategorizing them according to their essential (or at least
   relevant and/or cognitive) qualities. See subject index. This is an
   extension of the previous senses of "ontology" (above) which has
   become common in discussions about the difficulty of
   maintaining subject indices.




                        PHAR 201 Lecture 4, 2012                     19
    We will not focus too much on
      the formal definitions 


  But more on how these formal
concepts have been applied to biology

             PHAR 201 Lecture 4, 2012   20
The History of Ontologies from a Biological
              Perspective …
• Early biological database efforts (1990’s)
  adopted knowledge bases as a model e.g.
  RiboWeb
• They used the products from the AI
  community e.g. Ontolingua
• Some of the concepts of knowledge bases
  remain – notably ontologies, but they are
  now mostly cast in more familiar
  commercial frameworks e.g. relational
  databases
                PHAR 201 Lecture 4, 2012   21
 The History of Ontologies from a Biological
           Perspective Continued
• Biological community in general was slow to see the
  value
• Medical informatics community adopted ontologies
  early
• Late 90’s database providers in particular began to
  work together – the gene ontology (GO) being a major
  product of this effort
• 1998-2004 ontologies were the rage and warranted
  their own session at Bioinformatics meetings and are
  taken seriously by the biological community
• 2004- accepted as part of biological data
  representation and use
                    PHAR 201 Lecture 4, 2012         22
The History of Ontologies from a Biological
          Perspective Continued
• Centers established to support the
  maintenance of ontologies:
  – The Open Biomedical Ontologies (OBO)
    Foundry
  – National Center for Biomedical Ontology
    (BioPortal 2.0)




                 PHAR 201 Lecture 4, 2012     23
    What Isn’t An Ontology?
• A database or program
  – because they share internal formats only – it is not
    global
• A table of contents
  – Because it is not a formal representation of the
    concepts
• A terminology (aka controlled vocabulary)
  – Because it is a set of terms without a formal
    structure of how they relate


                   PHAR 201 Lecture 4, 2012                24
     Examples of Valuable Terminologies
    (Controlled Vocabularies) That Are Not
                  Ontologies

•   ICD-9 for diseases
•   SNOMED/RCD codes for symptoms
•   EC Numbers (?)
•   Taxonomy
•   SMILES strings


                 PHAR 201 Lecture 4, 2012    25
      Ontology As Language
• The ontology becomes the language of the
  domain it describes

• The language = syntax + semantics

• While that language must be understood
  by computers human readability counts


               PHAR 201 Lecture 4, 2012    26
          Ontology as Contract

Purposes of Ontologies            Parties to the contract
• data exchange                   • programmers
• unification/translation         • data admins
• calling knowledge               • programmers,
  services                          netbots
• representing theories           • scientists
• human                           • collaborators
  communication

                    PHAR 201 Lecture 4, 2012                27
         Ontology Specifications
• XML – provides a syntax for structured documents
• XML Schema - a language for structuring XML
  documents and adding data types
• RDF - a data model for objects and relations
  between them and represented in XML
• RDF Schema – describes properties and classes
  of RDF resources with semantics to generalize
• OWL 2 – Web Ontology Language – adds more
  vocabulary particularly of relationships between
  classes (e.g. disjointness, cardinality)
                  PHAR 201 Lecture 4, 2012       28
    Here is Another One..
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22_colored.html




                        PHAR 201 Lecture 4, 2012                              29
   References: Ontologies in
        Bioinformatics
• Bio-ontologies workshops since 1997
• Historical papers on knowledge
  sharing
• mmCIF as an ontology - Westbrook
  and Bourne (2000) Bioinformatics
  16(2) 159-168 [PDF]
• Review 2006 – Bodenreider and
  Stevens Briefings in Bioinformatics
              PHAR 201 Lecture 4, 2012   30
                      Agenda

• Before there were ontologies there was mmCIF

• Briefly review the history of ontology development

• Review the Gene Ontology (GO)
  – Motivation
  – Features
  – Related research activities around GO



                     PHAR 201 Lecture 4, 2012          31
                  References
• GO Itself - Creating the Gene Ontology Resource:
  Design and Implementation Genome Research
  (2001) 11:1425-1433
• Nucleic Acids Res. 2010 Jan;38(Database issue):D331-
  5. Epub 2009 Nov 17.
• The GO Website - http://www.geneontology.org
• Application of GO –
  The Gene Ontology Annotation (GOA) project:
  implementation of GO in SWISS-PROT, TrEMBL, and
  InterPro Genome Res. 2003 Apr;13(4):662-72. Epub
  2003 Mar 12

                    PHAR 201 Lecture 4, 2012         32
            Brief History
• Started by Saccharomyces Genome
  Database, FlyBase and the Mouse
  Genome Database

• Grown to a consortium of members (see
  here)



               PHAR 201 Lecture 4, 2012   33
  Roles of the GO Consortium
• Write and maintain the ontologies
  themselves
• Associate the ontologies to genes in the
  respective databases of members
• Provide tools to facilitate the development
  and maintenance of ontologies



                PHAR 201 Lecture 4, 2012    34
            Gene Ontology (GO)
                    http://www.geneontology.org/

•   Three levels of annotation:

    – Molecular function - what a gene product does at the
      biochemical level

    – Biological process - a broad biological perspective – not
      currently a pathway (no dynamics or dependencies)

    – Cellular component - location within cellular structures (eg Golgi
      apparatus) and macromolecular complexes (ribosome)



                          PHAR 201 Lecture 4, 2012                    35
GO Goals
From Genome Res 2001
  Aug;11(8):1425-33




                       PHAR 201 Lecture 4, 2012   36
                Structure of GO-
         Directed Acyclic Graph (DAG)

Example from molecular function:

                  Transmembrane                   Protein tyrosine
Parent                receptor                         kinase


                    is_a                               is_a
Child          Transmembrane receptor tyrosine protein kinase



                       PHAR 201 Lecture 4, 2012                      37
            Structure of GO-
     Directed Acyclic Graph (DAG)
     Relationship of Child to Parent
is_a represents an instance of


 part_of


      A mitotic chromosome is_a instance of a chromosome

      A telomere is part_of a chromosome


                         PHAR 201 Lecture 4, 2012          38
Example - Molecular Function




         PHAR 201 Lecture 4, 2012   39
Example - Biological Process




        PHAR 201 Lecture 4, 2012   40
Example - Cellular Location




        PHAR 201 Lecture 4, 2012   41
Use of GO within the PDB
   http://pdb.rcsb.org




       PHAR 201 Lecture 4, 2012   42
Use of GO Within the Open Literature




             PHAR 201 Lecture 4, 2012
                                        43
              Some Issues –
 Levels of Granularity – Species Specificity


• Chitin metabolism is part of cuticle
  synthesis in fly

• Chitin metabolism is part of cell wall
  organization in yeast


                 PHAR 201 Lecture 4, 2012      44
             Some Issues
• GO is dynamic – parent child relationships
  can change
• When does a process begin and end?
• Is_a and part_of not always clear – is actin
  cytoskeleton is_a cytoskeleton or part_of
  cytoskeleton
• A community effort

                PHAR 201 Lecture 4, 2012    45
 Relationship to Gene Products
• A gene product is a protein or functional
  RNA
• A gene product may have more than one
  function and therefore be related to
  multiple GO terms
• The name of a gene product may only
  reflect one of its functions


                PHAR 201 Lecture 4, 2012      46
     GO is Really 3 Independent
            Ontologies
• Annotation of a gene product by one ontology is
  independent of its annotation by another
  ontology
• Example: Products of the MDH1 MDH2 and
  MDH3 genes are all isoforms of malate
  dehydrogenanse in yeast with the same
  function, but localize to different cellular
  locations and are involved in different
  biochemical processes

                  PHAR 201 Lecture 4, 2012      47
          Evidence Codes
• The evidence for assigning a gene product
  to a GO term itself has a controlled
  vocabulary




               PHAR 201 Lecture 4, 2012   48
Research Applications of GO




         PHAR 201 Lecture 4, 2012   49
PHAR 201 Lecture 4, 2012   50
Research Applications of GO




     PHAR 201 Lecture 4, 2012   51
PHAR 201 Lecture 4, 2012   52
PHAR 201 Lecture 4, 2012   53

								
To top