1. Introduction and background
Overview and significance: This proposal concerns the development of
bioinformatic tools to enable the global study of genes in related biosynthetic
pathways. With the complete genome of an organism, it is possible to identify
the genes involved in the biochemical pathways that have been well studied. The
expression of all the genes in an organism, including those in biochemical pathways
can be studied using microarray technology. To my knowledge, nobody has developed
a system that allows gene expression data to be observed easily in the context
of a whole pathway. The working example we are using to develop these tools is
amino acid biosynthetic pathways in A. thaliana. The bioinformatic tools we create
in this project could be applied to the study of the regulation of any biochemical
pathway in any organism.
Background: In A. thaliana, it has been shown that the transcription of
genes involved in nitrogen assimilation into amino acids are regulated by a complex
combination of signals including light, carbon, and nitrogen. For example, light
activates the expression of the GLN2 gene (glutamine synthetase gene) and represses
the transcription of the ASN1 gene (asparagine synthetase gene). This regulatory
mechanism allows plants to synthesize glutamine (Gln) when carbon skeletons are
abundant, or convert Gln to asparagine (Asn), a more carbon-efficient
nitrogen-transport compound, when carbon skeletons are limiting. This light
regulation of gene expression is regulated in part by the phytochrome photoreceptor
and in part by light induced changes in carbon metabolites. Sucrose applied to the
media of dark-grown plants can mimic these effects of light; it induces GLN2
expression and represses ASN1 expression in the absence of light. Amino acids
are able to antagonize these effects of sucrose on gene expression. These studies
and others have led to the hypothesis that plants have mechanisms to sense both
carbon and nitrogen metabolites and to in turn regulate the expression of genes
encoding enzymes in nitrogen and carbon metabolism.
To explore the interactions of light, carbon, and nitrogen effects on gene
expression, the Coruzzi lab has used combinatorial design as a tool to effectively
test [TO TEST EACH INPUT’S EFFECT ON GENE REGULATION (you should refer to our paper
on the subject] all combinations of inputs on gene regulation. GLN2 and ASN1 are
being used as “sentinel” genes to report on which C:N treatments were effective
in gene induction or repression. Subsequently, selected RNA samples will be used
to probe an A. thaliana Affymetrix DNA chip containing ~8,000 genes. A complete
genome chip containing all 25,000 genes is scheduled to be available in early 2002.
The resulting microarray data will be subjected to cluster analysis to determine
co-regulated gene sets. Preliminary clustering analysis of microarray data from
nitrate treatments, showed that a cluster of 7 metabolically related genes encoding
enzymes involved in nitrogen metabolism or enzymes involved in the biosynthesis
of cofactors involved in nitrate reduction and nitrogen metabolism. This suggests
that genes in biochemically related pathways in A. thaliana might be coordinated
by metabolic transcriptional control mechanisms. This is reminiscent of yeast
general control mechanism in which amino acid starvation is known to induce genes
in all the amino acid biosynthetic pathways. Herein, we propose to develop
bioinformatic tools that will enable us to globally analyze the expression of all
the genes involved in amino acid biosynthesis in A. thaliana and to determine
whether specific steps in a pathway or in related pathways are subjected to
metabolic control of gene transcription. These experiments allow us to determine
how factors such as light, carbon, and nitrogen globally regulate amino acid
biosynthetic genes in plants and whether there is a master regulatory system for
the biosynthesis of all of the amino acids in plants.
Need for a database of biochemical pathways that can be used to query
microarray expression datasets: My goal is to develop bioinformatic tools, which
will enable researchers to study the regulation of genes in biochemically related
pathways. As a first step, I propose to construct a database called the "Pathways
And Genome Expression Database" (PAGED). Initially, I will use amino acid
biosynthetic pathways as a model to construct this microarray searchable database.
This is the first choice with which to build and test PAGED because the enzymes
in these pathways and the connections between them are well studied. Once
developed, PAGED can be expanded to include other biosynthetic pathways including
sugar and nucleoside metabolism, as well as others. I am well aware that pathway
databases (KEGG and one through EXPASY) already exist for general metabolic
pathways. However, these are not suitable for our purpose for two reasons:
1. In some cases, amino acid biosynthesis in A. thaliana is subtlely [CHECK
SPELLING] different than in bacteria and even yeast. These differences, while
subtle, are important and are not contained in those databases.
2. More importantly, the existing biochemical pathway databases cannot be
used to query microarray expression datasets.
In order to construct PAGED, our input data will be derived from our knowledge
of amino acid biosynthesis enzymes and genes in A. thaliana. The first step is
to identify all the steps in each of the 20 amino acid biosynthetic pathways in
A. thaliana. Using published data and the complete sequence of A. thaliana, the
data to construct the amino acid biosynthetic pathways of A. thaliana does exist.
However, to my knowledge the genes for all 20 amino acid biosynthetic pathways
of A. thaliana have not been gathered in one location. For some amino acids, like
methionine and tryptophan, the pathway has been experimentally determined in A.
thaliana. For others, like phenylalanine and tyrosine, there is very little
experimental data. In such cases, we relied heavily on sequence alignment data
to identify putative orthologs in A. thaliana from organisms in which a pathway
has been experimentally determined. In some cases, multiple biosynthetic routes
are possible, and I have tried to use sequence alignment data to validate the
existence or absence of each biosynthetic route in A. thaliana. Based on my
considerable expertise as a biochemist, and my knowledge of computer-based
orthology searches from my Ph.D. studies, I have already begun to identify A.
thaliana genes involved in amino acid biosynthesis, and I have reconstructed at
least one pathway for the biosynthesis of each amino acid. It is possible that
A. thaliana contains yet unidentified amino acid biosynthetic pathways Therefore,
PAGED will be constructed to allow for growth/revisions as more information becomes
available. PAGED will also be designed to be queried by a biologist to answer
biologically relevant questions, such as “If nutrient starved plants are feed [FED]
a carbon source, but not a nitrogen source, how are the genes involved in histidine
biosynthesis affected? or What genes for enzymes are affected and in what
biochemical pathways are they involved?”
Construction of the Pathway And Genome Expression Database (PAGED): PAGED
construction is being accomplished with PostgreSQL so that queries of microarray
expression data like those described above can easily be made with SQL-queries.
Java will be used to create a user interface, so the database will be accessible
to any JAVA enabled web browser.
Overtime, more biosynthetic pathways (e.g. sugar metabolic pathways, which
interface with amino acid synthesis) and more microarray expression datasets from
more treatment conditions will be added making PAGED more comprehensive. This
will be a first step in creating a "virtual plant", which will allow growth
conditions to be tested ahead of time to determine how certain genes will be
affected. To my knowledge no such microarray searchable gene database for
biochemical pathways exists for any organism, and PAGED will be a model for others.
Careful design of PAGED is necessary to ensure that it will easily be able to grow
and that it will contain information (such as compartmentalization of proteins,
equilibrium constants for reactions, and known post-translational regulation of
enzymes i.e. inhibitors) that will guarantee that is has uses beyond gene
regulation. To ensure flexibility, we propose to create the pathway database using
two different approaches. In one approach, the "pathway on the fly" approach,
we will input all the biochemical reactions and allow the pathways to be created
from the database. In, the second approach we will strictly define each pathway
in the database. Both databases will be constructed and tested for their ability
to grow and answer biologically relevant queries.
For the "pathway on the fly" approach, a query asking, “Starting with
molecule A what enzymes are needed to make molecule X and how is the transcription
of their genes affected under condition Y?” will require transient [THE
CONSTRUCTION OF THE TRANSITIVE CLOSURE\ closures to build the pathway. It will
also require that each biochemical reaction be given a unique identifier. We have
considered using Enzyme Commission (E.C.) numbers, however an E.C. number does
not take into account how the chemistry occurred just what chemistry occurred.
e.g. There are two known forms of methylmalonyl-CoA decarboxylase, one that uses
biotin and one that does not. The chemistry in each case is different, but they
have the same E.C. number (188.8.131.52). Such information is important to us, because
it is known from nitrate treatments of A. thaliana that not only the genes involved
in nitrate assimilation were up-regulated, but also the genes involved in making
cofactors for those enzymes are up-regulated. One approach would be to assign
each reaction a number, based not only on the overall reaction, but also what
cofactors are required for the reaction to occur. This approach is being
investigated and may only require modifying the E.C. number in a few situations.
This approach also requires that naming of molecules be consistent throughout the
database. If multiple names (e.g. -ketoglutarate is also 2-oxoglutarate) are
given to the same molecule it will adversely affect the pathway building process.
To avoid this problem, each molecule will be given a unique number, such a system
is used by DBGET in the Ligand database, and we will investigate using their
nomenclature system or one similar to it.
Table 1: Definition of biochemical reactions used in PAGED. The database will
be designed to handle cases where a reaction has multiple reactants and products,
[THAT’S SUCH A COMMON CASE, I THINK THE UNCAREFUL REVIEWER WIWLL THINK THIS IS
JUST A MISTAKE; I WOULD BE MORE COMPLETE IF I WERE YOU] which is not shown in the
example below. From the information below, several pathways could be built by
PAGED with several ending and starting points. So, if a query asked about the
biosynthesis of molecule number 80 starting from molecule number 10, a pathway
would be built by PAGED by combining reactions 1, 2, and 6 sequentially.
reaction ID reactants stoichiometry products stoichiometry
1 10 1 20 1
2 20 1 50 1
3 30 1 40 1
4 10 1 60 1
5 40 1 70 1
6 50 1 80 1
7 40 1 60 1
In order for PAGED to be searchable by enzyme, reactant, and product names,
it will require tables to convert the enzyme and molecule names to numbers, which
then can be used to search other tables. Even, our initial design of PAGED will
require several tables:
1. molecule ID, molecule name
2. reaction ID, reactants (a molecule ID), stoichiometry,
3. reaction ID, products (a molecule ID), stochiometry
4. reaction ID, enzyme name, free energy
5. gene, enzyme name
6. Affymetrix ID, gene
7. experiment, treatments, conditions
8. experiment, treatment, replica, Affymetrix ID, microarray value
The first five tables are needed to contain the reactions, enzymes, genes and
related information to build the pathways. The sixth table will be needed to join
the information in the first four tables to the expression data and the experimental
methods, which will be in the seventh and eighth table.
While internally PAGED will use numbers to represent reactions and molecules
(Table 1), those numbers will not be part of the output. Using the tables discussed
above, those numbers can be transformed back to molecule and enzyme names. See
Table 2 for a hypothetical output for PAGED.
Table 2: A hypothetical output for PAGED from a query involving cysteine
metabolism. Some of the information contained in the table is not real, but just
used as an example. The user will be able to control exactly what information
is contained in the output.
gene enzyme name reactants products reaction fold
name free energy change
SIR1 sulfite reductase sulfite sulfide -1.2 +2.1
SAT1 serine serine O-acetylserine -57.8 +8.3
acetyltransferase acetyl-CoA CoA
SAT2 serine serine O-acetylserine -57.8 -1.4
acetyltransferase acetyl-CoA CoA
OSL1 O-acetylserine O-acetylserine cysteine -23.2 +5.7
lyase (thiol) sulfide acetate
The other approach to build PAGED is to strictly define each [I THINK THE
TWO APPROACHES ARE SIMPLY CONFUSING. PRESENT THE ON-THE-FLY APPROACH ON ITS OWN]
biochemical pathway and to enter them as a performed set into the database. This
approach would require that a table be built in the database for each pathway.
Using this approach, an enzyme that catalyzes the same reaction in different
pathways will be entered into the database multiple times. For the biosynthetic
pathways for the twenty amino acids this is not difficult, but as the database
grows it may become more cumbersome. This method however does avoid transient
[TRANSITIVE] closures, and the need for the database to build the pathway. These
aspects, ensure pathways that are known to be biologically relevant. The tables
required for this more rigid set database can be constructed by the "pathway on
the fly" database, which will also be a test for its accuracy.
Applications of PAGED to analyze microarray expression sets: PAGED can help
uncover whether amino acid biosynthetic pathways are co-regulated at the
transcriptional level or if subsets of amino acids (e.g. aromatic amino acids)
are co-regulated under certain conditions. While it is interesting to know how
different amino acid biosynthetic genes are regulated under certain conditions,
it is more interesting to know why that regulation occurs. In some cases, the
final product will be the amino acid itself- GLN2 is induced to produce Gln, which
is used to store and transport organic nitrogen. In other cases, the final product
will be downstream of the amino acid e.g. methionine biosynthesis maybe induced
under certain growth treatments so that ethylene production can increase. PAGED
can easily be expanded to include the downstream pathways into which the amino
acids feed, which will help determine which other pathways are affected. To date,
several different techniques exist to identify co-regulated genes. Genes can be
clustered for those that are expressed similarly under many different growth
conditions, or the promoter regions of genes that are expressed similarly under
one condition can be searched for patterns that may constitute a cis-acting
element. PAGED will cooperate well with both techniques because it will be designed
to handle expression data from many different conditions, and identify biochemical
relationships between genes based on biochemical pathways.
PAGED can also help answer other questions. One question of interest is,
what genes must be regulated in order to control flux through a pathway?
Historically, it was believed that organisms control flux through a pathway by
regulating a single step in the pathway. However, metabolic control analysis
studies have indicated that changing the activity or amount of a single enzyme
would be insufficient to change the overall flux through a multistep pathway.
Since, experimental evidence has been found supporting both views. While at this
stage our analysis does not require the measurements of metabolite concentrations
(which is required to strictly determine flux through a pathway), knowing how
organisms regulate the genes of proteins involved in a pathway under a variety
of conditions will certainly suggest how organisms control flux through a pathway.
In addition, the database is being designed to contain information that will allow
it to help study regulation beyond gene control.
The information and technology exist to construct PAGED, the Pathways And
Genome Expression Database. As expression data for A. thaliana increases, it will
be necessary that its storage and analysis be organized in some rational manner.
PAGED will allow it to be organized so that it will be easily searchable by biologist
to answer biologically relevant questions related to the regulation of pathways.
To my knowledge, no database like this exists for any organism, and PAGED will
be a first step in creating a virtual plant. Such a database will encourage
collaborations with other groups interested in metabolic pathways and control of
3. Training Objectives
The fields of computer sciences, biology, and biochemistry are becoming more
and more intertwined. The computer sciences allow the analysis of large amounts
data like that created by genomic and microarray studies. The use of computers
to analyze data from microarray and genomic studies allows one to determine what
biochemical pathways are in an organism and how they are affected under different
conditions. The knowledge of the regulation of genes will allow the determination
of affected pathways, and the final products of the pathways can be determined
based on classical biochemical studies. Such knowledge will drive biology as
researchers attempt to understand the effects of the final product of the
biochemical pathways in the organism. It is therefore necessary to be well trained
in all three fields to be involved in this process at all the steps. I already
have training in biochemistry, and during this grant, I will receive new training
in A. thaliana molecular biology, genomics, and genetics in the Coruzzi lab. I
will also receive new training in computer sciences from the NYU Courant
bioinformatics group [AND THROUGH COURSES SELECTED BY MY ADVISORS]. While this
training will center around the analysis of gene expression data and database
creation, management, and querying, it will introduce me to new ways to use computer
sciences to solve problems in the biological sciences. In my doctoral work, I
was struck by the ability of computers to predict the function of proteins based
on sequence alignment data. I think predictions drive science, and therefore
making predictions is necessary to be a good scientist. Computers aid in the
process by allowing one to sort through large amounts of data, determine what is
useful, and then make a prediction based on that data. The use of computers to
make predictions in the biological sciences is growing and that trend will
continue. I would like to be part of that trend, and this proposal puts me in
a position to receive training from three experts (Gloria Coruzzi, Dennis Shasha,
and Bud Mishra) in the fields of genomics and bioinformatics.
4. Choice of Sponsor and Institution
This proposal is the result of a collaboration between the Coruzzi lab in
the New York University Biology Department and the Mishra and Shasha group of the
Courant Institute at New York University. The Coruzzi lab is experienced in
studying the transcription of genes involved in nitrogen metabolism in plants.
In order to study this problem, they use a wide variety of techniques including
reverse and forward genetics and microarray analysis. Dr. Mishra's group at the
Courant Institute is a leader in the bioinformatics field and is involved in diverse
bioinformatic projects including microarray expression clustering and database
management. Dr. Shasha has been involved in analysis of expression data sets and
cis-element analysis. Several members of both groups are receiving training in
biology and computer sciences, which creates a rich environment for me to receive
my new bioinformatic training.