Pathways Databases A Case Study in Computational Symbolic Theories

Document Sample
scope of work template
							Pathways Databases: A Case Study in
 Computational Symbolic Theories


       Author: Peter D. Karp


        Presented by: Gunjan Gupta
              April 13, 2004
                        Overview


1.   What is Gene Co-expression and why discover gene co-
     expression ?
2.   Using DNA microarray data for discovering functional gene
     co-expression.
3.   Discussion on how to use phylogenetic information to filter
     out false positives in the micro-array data.
4.   Detailed discussion of the proposed solution.
  A *Complete Scientific Theory of a simple organism
                  is very Complex
    A “simple” organism such as E. Coli is very complex –

                                Example, for E. Coli:
                                - 791 chemical compounds & 744 enzyme-
                                       catalyzed biochemical reactions for
                                       the Metabolic pathways.
                                - 4290 “protein coding” genes
                                - A perhaps larger number of mostly
                                       unknown RNA-encoding genes




*Explains all the observed processes in an organism
 Too complex(large) for a single Scientist to grasp
More complex than a factory producing cars –

         Imagine a car factory that not only produces cars, but produces
         other factories of its own kind, and everything else needed to
         produce, reproduce, repair, fend off competition, fight foreign
         attacks, adapt to a changing environment, & evolve !
 Traditional methods of representing theoretical
                 “knowledge”


1.   Technical papers in journals
2.   Diagrams, flowcharts & tables
3.   Mostly natural language descriptions more suitable for
     humans.
4.   Raw and partially processed data in large biological
     databases: example, indexed by search criteria, annotated in
     English etc.
     Traditional methods not enough, why ?

1.   Cannot represent scientific theories for biological systems –
     too complex.
2.   Automatic deductions not possible as the theory is not stored
     in a structured, computer readable presentation.
3.   Natural Language Processing not developed to the level yet
     to allow a computer to search human readable data.
4.   Only bits and pieces understandable by one person at any
     given time.
5.   Large-scale patterns and deductions might never be found in
     the absence of any one individual who can see the “big
     picture”.
6.   Addition of details from experimentation does not necessarily
     improve our understanding of the whole system.
         Pathway Databases for Biology
1.   A Database of Scientific theory storing “qualitative
     information” – semantics of the theory in a well-defined
     form.
2.   Useful for representing theories that are mostly non-
     numerical in nature – a biological system is well suited.
3.   Definition: A Pathway is a linked set of biochemical
     reactions, linked as follows –
                                                              Theory of
                                                              relationships
                                           Reaction 6         between
 Reaction 1            Enzyme 3
                                                              1,2,3,4,5,6 &
                                                              7 defined)

 Reaction 2            Enzyme 4                  Chemical 8

                    Raw Material 5     Substrate 7
Different Kinds of Biological Pathways Database
                     (PDB)

 1.       Only Metabolic Pathways (majority): (time= generally fast)
      –      e.g. Production of ATP in the Mitochondria
 2.       Signaling Pathways (time= generally medium)
      –      Intra-cellular: e.g. cell-membrane to nucleus
      –      Extra-cellular: e.g. growth-factors for nerve, skin, say for
             injury..
 3.       Genetic-regulatory pathways (time= generally medium/slow)
      –      gene expression control, make more as needed, example
             synthesize amino-acids deficient in food by increasing
             expression of related genes.
 4.       A combination of one or more of the above (e.g. PAD,
          EcoCyc).
Pathways Databases: an intersection of 4 fields



          Genomics              Biochemistry




                                Databases
          Artificial
          Intelligence
              Examples of current PDBs


•       Metabolic Pathway Database
    –      EcoCyc (author) : http://www.ecocyc.org/


•       Signaling Pathways
    –      SPAD: http://www.grt.kyushu-u.ac.jp/spad/


•       Genetic-regulatory pathways

    –      BIOBASE: http://www.gene-regulation.com/
                        EcoCyc Project


•       EcoCyc: Encyclopedia of Escherichia coli K12 Genes and
        Metabolism
    –      At http://www.ecocyc.org/
    –      Links to a bigger site : http://biocyc.org/server.html containing
           PDB for other organisms including human.
What Metabolic & Genetic Pathway Info does
            EcoCyc Contain ?
For each Enzyme in the PDB –
•     Detailed description of reaction catalyzed by each enzyme.
•     Genes to which the enzymes map to, if available.
•     The range of substrate the enzyme would accept.
•     Chemicals that inhibit or activate the Enzyme.
•     Its subunit structure.
•     Each small molecule enzyme substrate

Pathways types for E-Coli included
    –    Biosynthesis of cellular bulding blocks.
    –    Extraction of Carbon from food.
    –    Extraction of chemical energy from food.
           Tools & Visualization in EcoCyc
1.       SRI International (http://www.sri.com/) developed the
         visualization and search tool called Pathway Tools as an
         intelligent user interface.
2.       Allows the user to exploits the semantic information in the
         PDB and write complex queries.
3.       Visualize results as a Pathway graph called Overview
         Diagram.
4.       A variety of criteria for the queries –
     –      Name matching
     –      Classification hierarchy (taxonomy, metabolic pathways)
5.       Example:
     –      Find all reactions that are activated or inhibited by a given
            metabolite.
6.       Superimposing genetics data on the visualization (see demo)
7.       User can create a PGDB for a new organism and share it.
                            V
                            i
                            s
                            u
                            a
                            l
                            i
                            z
                            a
                            t
             Metabolite     i
             categories
                            o
                            n



               Individual
Glycolysis     Reactions
region
     An application: “automated” inter-species
             comparison of reactions
1.     Yellow shows reactions in E. Coli that match with another
       species – S. Cerevisiae, found as a result of database lookup.
2.     Mostly automated layout using Pathway Tools (some manual
       fitting was done by author in this paper for this diagram).
Exploiting Knowledge Representation AI tools in a
                     PDB

   1.    Building an Ontology: DB schema defining the precise
         relationships between entity : use UML  ?
   2.    Encode a theory using the Ontology
   3.    EcoCyc ontology consists of 1000 object-oriented classes
         encoding key concepts of biology and biochemistry.
   4.    Extend Ontology when new concepts are found that
         cannot be derived using existing Ontology.
   5.    Use KR techniques that exploit Symbolic AI reasoning to
         say –
        •     build an inference engine (see Bruce Porter’s
             Knowledge Machine for example) on the PDB or
        • Perform specific global inferences using specific
             relationship searches.
Example of Symbolic Inferences on EcoCyc


•    Results of a search changed the simplistic notion of what
     gene is –
    – In E. Coli found 1 out of 7 enzymes catalyze more
        than one reaction, and almost 1 out of 7 cases where
        an reaction is used in more than one Pathway
        (overlapping sub-graphs or clusters): this can be
        treated as a discovered theory.
•    Characterized the transcription factors relationship (see
     next slide).
•    Other interesting theories found using PDB –
    – Scale free network topology that follows a power law
        for both Metabolic and Genetic networks.
    – Deletion of proteins with high connectivity more
        likely to kill the organism.
Example 2: Characterizing Transcription Factors
   “inter-relationship” in a genetic pathway
  Most do not regulate themselves or   Just two dominate the
  other transcription factors          relationships making the tree
                                       very shallow
                         EcoCyc Demo


•       Demo 1: Search demo
    –      Go to site http://www.ecocyc.org/
    –      Click on DB search
    –      Search for “Glycolysis” and show to class if time permits ..
•       Demo 2: Combining pathways with Gene expression data –
    –      Go to site http://www.ecocyc.org:1555/expression.html
    –      Specify data file as http://biocyc.org/coli.dat (might have to
           save it locally to work).
    –      Select “absolute” display level.
    –      Enter ratio for numerator as 1.
                          Issues/Comments
1.   The idea itself is quite powerful – combining sequence data with
     DNA array, and results seem to be quite good, but ..
2.   Too many steps – hard to quantify the results except empirically,
     because of the complexity. At least 5 levels of successive
     transformations (1.genes to meta-genes via blast 2. Pearson
     correlation 3. order based probability 4. network thresholding 5.
     Clustering).
3.   Some heuristics are not explained much – for clustering,
     transformation into 2-d space from probability for example, where
     the original problem was a graph- why not directly partition the
     graph ?
4.   Limited quality of clusters because of 2-d translation. Not clear why
     and how the data fits into a 2-d space. Not clear if the translation
     using P-value is a metric.
5.   The paper was obviously not written by plain computer scientists -
     lot of interesting discoveries and analysis after the method was used.
          One line summary of the paper ..




Using phylogenetics to filter out gene co-expressions in
micro-array data that are not functionally relevant.

						
Related docs