Pathways Databases: A Case Study in
Computational Symbolic Theories
Author: Peter D. Karp
Presented by: Gunjan Gupta
April 13, 2004
1. What is Gene Co-expression and why discover gene co-
2. Using DNA microarray data for discovering functional gene
3. Discussion on how to use phylogenetic information to filter
out false positives in the micro-array data.
4. Detailed discussion of the proposed solution.
A *Complete Scientific Theory of a simple organism
is very Complex
A “simple” organism such as E. Coli is very complex –
Example, for E. Coli:
- 791 chemical compounds & 744 enzyme-
catalyzed biochemical reactions for
the Metabolic pathways.
- 4290 “protein coding” genes
- A perhaps larger number of mostly
unknown RNA-encoding genes
*Explains all the observed processes in an organism
Too complex(large) for a single Scientist to grasp
More complex than a factory producing cars –
Imagine a car factory that not only produces cars, but produces
other factories of its own kind, and everything else needed to
produce, reproduce, repair, fend off competition, fight foreign
attacks, adapt to a changing environment, & evolve !
Traditional methods of representing theoretical
1. Technical papers in journals
2. Diagrams, flowcharts & tables
3. Mostly natural language descriptions more suitable for
4. Raw and partially processed data in large biological
databases: example, indexed by search criteria, annotated in
Traditional methods not enough, why ?
1. Cannot represent scientific theories for biological systems –
2. Automatic deductions not possible as the theory is not stored
in a structured, computer readable presentation.
3. Natural Language Processing not developed to the level yet
to allow a computer to search human readable data.
4. Only bits and pieces understandable by one person at any
5. Large-scale patterns and deductions might never be found in
the absence of any one individual who can see the “big
6. Addition of details from experimentation does not necessarily
improve our understanding of the whole system.
Pathway Databases for Biology
1. A Database of Scientific theory storing “qualitative
information” – semantics of the theory in a well-defined
2. Useful for representing theories that are mostly non-
numerical in nature – a biological system is well suited.
3. Definition: A Pathway is a linked set of biochemical
reactions, linked as follows –
Reaction 6 between
Reaction 1 Enzyme 3
Reaction 2 Enzyme 4 Chemical 8
Raw Material 5 Substrate 7
Different Kinds of Biological Pathways Database
1. Only Metabolic Pathways (majority): (time= generally fast)
– e.g. Production of ATP in the Mitochondria
2. Signaling Pathways (time= generally medium)
– Intra-cellular: e.g. cell-membrane to nucleus
– Extra-cellular: e.g. growth-factors for nerve, skin, say for
3. Genetic-regulatory pathways (time= generally medium/slow)
– gene expression control, make more as needed, example
synthesize amino-acids deficient in food by increasing
expression of related genes.
4. A combination of one or more of the above (e.g. PAD,
Pathways Databases: an intersection of 4 fields
Examples of current PDBs
• Metabolic Pathway Database
– EcoCyc (author) : http://www.ecocyc.org/
• Signaling Pathways
– SPAD: http://www.grt.kyushu-u.ac.jp/spad/
• Genetic-regulatory pathways
– BIOBASE: http://www.gene-regulation.com/
• EcoCyc: Encyclopedia of Escherichia coli K12 Genes and
– At http://www.ecocyc.org/
– Links to a bigger site : http://biocyc.org/server.html containing
PDB for other organisms including human.
What Metabolic & Genetic Pathway Info does
EcoCyc Contain ?
For each Enzyme in the PDB –
• Detailed description of reaction catalyzed by each enzyme.
• Genes to which the enzymes map to, if available.
• The range of substrate the enzyme would accept.
• Chemicals that inhibit or activate the Enzyme.
• Its subunit structure.
• Each small molecule enzyme substrate
Pathways types for E-Coli included
– Biosynthesis of cellular bulding blocks.
– Extraction of Carbon from food.
– Extraction of chemical energy from food.
Tools & Visualization in EcoCyc
1. SRI International (http://www.sri.com/) developed the
visualization and search tool called Pathway Tools as an
intelligent user interface.
2. Allows the user to exploits the semantic information in the
PDB and write complex queries.
3. Visualize results as a Pathway graph called Overview
4. A variety of criteria for the queries –
– Name matching
– Classification hierarchy (taxonomy, metabolic pathways)
– Find all reactions that are activated or inhibited by a given
6. Superimposing genetics data on the visualization (see demo)
7. User can create a PGDB for a new organism and share it.
An application: “automated” inter-species
comparison of reactions
1. Yellow shows reactions in E. Coli that match with another
species – S. Cerevisiae, found as a result of database lookup.
2. Mostly automated layout using Pathway Tools (some manual
fitting was done by author in this paper for this diagram).
Exploiting Knowledge Representation AI tools in a
1. Building an Ontology: DB schema defining the precise
relationships between entity : use UML ?
2. Encode a theory using the Ontology
3. EcoCyc ontology consists of 1000 object-oriented classes
encoding key concepts of biology and biochemistry.
4. Extend Ontology when new concepts are found that
cannot be derived using existing Ontology.
5. Use KR techniques that exploit Symbolic AI reasoning to
• build an inference engine (see Bruce Porter’s
Knowledge Machine for example) on the PDB or
• Perform specific global inferences using specific
Example of Symbolic Inferences on EcoCyc
• Results of a search changed the simplistic notion of what
gene is –
– In E. Coli found 1 out of 7 enzymes catalyze more
than one reaction, and almost 1 out of 7 cases where
an reaction is used in more than one Pathway
(overlapping sub-graphs or clusters): this can be
treated as a discovered theory.
• Characterized the transcription factors relationship (see
• Other interesting theories found using PDB –
– Scale free network topology that follows a power law
for both Metabolic and Genetic networks.
– Deletion of proteins with high connectivity more
likely to kill the organism.
Example 2: Characterizing Transcription Factors
“inter-relationship” in a genetic pathway
Most do not regulate themselves or Just two dominate the
other transcription factors relationships making the tree
• Demo 1: Search demo
– Go to site http://www.ecocyc.org/
– Click on DB search
– Search for “Glycolysis” and show to class if time permits ..
• Demo 2: Combining pathways with Gene expression data –
– Go to site http://www.ecocyc.org:1555/expression.html
– Specify data file as http://biocyc.org/coli.dat (might have to
save it locally to work).
– Select “absolute” display level.
– Enter ratio for numerator as 1.
1. The idea itself is quite powerful – combining sequence data with
DNA array, and results seem to be quite good, but ..
2. Too many steps – hard to quantify the results except empirically,
because of the complexity. At least 5 levels of successive
transformations (1.genes to meta-genes via blast 2. Pearson
correlation 3. order based probability 4. network thresholding 5.
3. Some heuristics are not explained much – for clustering,
transformation into 2-d space from probability for example, where
the original problem was a graph- why not directly partition the
4. Limited quality of clusters because of 2-d translation. Not clear why
and how the data fits into a 2-d space. Not clear if the translation
using P-value is a metric.
5. The paper was obviously not written by plain computer scientists -
lot of interesting discoveries and analysis after the method was used.
One line summary of the paper ..
Using phylogenetics to filter out gene co-expressions in
micro-array data that are not functionally relevant.