DATABASES & PATHWAYS File name DATABASES & PATHWAYS 2011
● Genome, transcriptome, proteome, phenome (mutant phenotype), biochemical, and metabolic pathway
databases and their associated tools offer powerful ways to investigate metabolism.
● Genomics-driven approaches (‘database mining’) complement classical biochemical approaches to the
metabolism of all organisms, including plants.
Sequence and expression information - from genomes, transcriptomes, proteomes, etc – complements
biochemical information in several ways:
1. Identifying genes for plant enzymes. Because enzymes (and some transporters) are conserved,
homology searches (with BLAST programs) using prokaryotic, yeast, or animal sequences as query can
identify the corresponding plant proteins, and show whether they are encoded by single genes or gene
Searching plant genomes in this way can show which enzymes are present and which are absent. This in
turn allows ‘metabolic reconstruction’, i.e. predicting metabolic capabilities (the metabolic pathways that are
present) from DNA sequence data alone.
- Plant sequences can be expressed heterologously (e.g., in E. coli or yeast, with a tag to facilitate
purification), and the recombinant proteins can be characterized. This is especially useful for low-
abundance or unstable proteins, which are difficult or impossible to isolate from plants in sufficient
amounts for study.
- The functions encoded by plant sequences can be investigated using functional complementation in
2. Predicting organellar targeting, localization in membranes. Genomic sequences, cDNAs, and ESTs
and can give information about the organellar targeting of enzymes, via their characteristic signal sequences,
and about whether proteins have membrane-spanning domains and hence are likely to be located in
Organellar proteome databases can provide high-throughout experimental support for these predictions.
Knowing organellar location can rule in or out possible metabolic functions of proteins.
3. Predicting biochemical function from expression data (microarrays, RNAseq). When, where, and at
what level a gene is expressed can likewise provide clues about function. Correlated patterns of gene
expression (‘co-expression’) in relation to development, environment, or genetic changes (e.g., knocking out
or overexpressing genes) can point to related function.
Similar information can come from digital gene expression profiles, in which differentially expressed genes
are detected from variation in the count of their cognate ESTs in libraries.
4. Discovering enzymes and pathways by comparative genomics. By looking for functional linkages
among genes in bacteria and archaea (gene fusions, conserved gene clusters, and co-occurrence patterns) it is
- Identify enzymes and transporters that are ‘missing’ from known pathways
- Discover new enzymes, pathways, and processes.
Having found a new prokaryotic enzyme by this approach, its counterpart can be sought in plants via
homology searches. Conversely, if an unknown plant enzyme has prokaryotic homologs, comparative
genomic analysis of the latter can help predict the function of the enzyme in both groups. This is a powerful
approach because prokaryotes share many pathways with plants.
This part of the course introduces web resources needed to extract the above types of information, and
illustrates how to use them.
NCBI http://www.ncbi.nlm.nih.gov/ Entrez nucleotide and protein data bases; Blast similarity search
CD-Search http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi NCBI Conserved Domain Database.
Well-annotated models for ancient domains and full-length proteins. Example:
>Ureaplasma urealyticum ATP synthase C chain (EC 184.108.40.206)
Multalin Sequence Alignment http://bioinfo.genotoul.fr/multalin/multalin.html Aligns protein or DNA
sequences (output in color) and draws simple phylogenetic trees.
ClustalW Sequence Alignment http://www.genome.jp/tools/clustalw/ ClustalW protein or DNA
sequence alignment (also has MAFFT and PRRN alignment programs)
ExPASy Translate Tool http://www.expasy.ch/tools/dna.html Translates a DNA sequence in all 6 frames.
Phylogeny.fr http://www.phylogeny.fr/ Web-based, robust phylogenetic analysis for the non-specialist or
MEGA http://www.megasoftware.net/ The MEGA5 phylogeny program – downloads and manual.
Targeting prediction (membranes, chloroplast, mitochondrion, vacuole, etc) and targeting peptide
TMHMM http://www.cbs.dtu.dk/services/TMHMM/ Prediction of transmembrane helices in proteins.
WoLF PSORT http://wolfpsort.org/
METABOLIC PATHWAY RESOURCES
Swiss-Prot Enzyme http://ca.expasy.org/enzyme/ Enzyme nomenclature data base (linked to SWISS-
PROT protein database, BRENDA, KEGG, etc)
BRENDA http://www.brenda-enzymes.info/ Comprehensive enzyme database.
KEGG http://www.genome.jp/kegg/ The Kyoto Encyclopedia of Genes and Genomes. Includes
metabolic pathways, and compound structures that can be captured.
BioCyc, EcoCyc, MetaCyc, YeastCyc http://BioCyc.org/ EcoCyc - Encyclopedia of E. coli Genes and
Metabolism; MetaCyc - Metabolic Encyclopedia. Also computationally-derived pathway/genome databases.
AraCyc http://www.arabidopsis.org/biocyc/index.jsp Similar to BioCyc, for Arabidopsis. Software allows
querying, graphical representation of pathways, and overlay of expression data on the biochemical pathway
KEGG and the various Cyc databases have similar aims but each has features the others lack.
Beware! These metabolic pathway databases have weaknesses:
- They have omissions and errors in their pathways – so they should be checked against the literature.
- Proteins are very often (for non-model organisms, almost always) assigned functions based solely on
homology - but it is not clear from the database that this is what has been done.
- To reach firm conclusions it is therefore necessary to go to the literature to find whether a putative
function has been authenticated biochemically or genetically.
PLANT GENOME RESOURCES
TAIR http://www.arabidopsis.org/ The Arabidopsis information resource.
Maizesequence.org http://www.maizesequence.org/index.html Browser providing the latest sequence
and annotation of the maize genome from the Maize Genome Sequencing Project
MaizeGDB http://www.maizegdb.org/ Maize genetics and genomics database
Gramene http://www.gramene.org/ Curated, open-source, data resource for comparative genome analysis
PLANT TRANSCRIPTOME RESOURCES
Golm Transcriptome database Microarray based. http://csbdb.mpimp-
golm.mpg.de/csbdb/dbxp/ath/ath_xpmgq.html Gives an overview of expression and searches for co-
ATTED http://atted.jp/ Microarray based. Searches for co-expression patterns in Arabidopsis (and also
rice); shows gene networks, not just lists of correlated genes.
Gene Indices http://compbio.dfci.harvard.edu/tgi/ EST based. Analyzes public EST data (contig
assembly, analysis of expression patterns).
PLANT PROTEOME RESOURCES
PPDB http://ppdb.tc.cornell.edu/ The Plant Proteome DataBase
SUBA II http://www.plantenergy.uwa.edu.au/suba2/ SUB-cellular location database for Arabidopsis
proteins (includes GFP and MS-MS data)
PLANT PHENOME RESOURCES
SeedGenes http://www.seedgenes.org/ Genes that give a seed phenotype when disrupted by mutation.
Chloroplast2010 http://www.plastid.msu.edu/ Large set of phenotypic for homozygous mutant of
RAPID http://rarge.gsc.riken.jp/phenome/ RIKEN Arabidopsis Phenome Information Database,
phenotypic data in transposon-insertional mutants.
BAPDB http://bioweb.ucr.edu/bapdb/ Bioassay And Phenotype DataBase
PlantMetabolomics http://tht.vrac.iastate.edu:81/ Consortium profiling the metabolome of specific T-
DNA knockout alleles for targeted genes
COMPARATIVE GENOMICS (‘PHYLOGENOMICS’) RESOURCES
STRING http://string.embl.de/ STRING is a database of known and predicted protein-protein relationships,
derived from genomic context (fusions, conserved gene clusters, co-occurrence), high throughput
experiments (co-expression), and the literature. STRING quantitatively integrates data from bacteria and
SEED http://www.theseed.org/wiki/Main_Page Database with ~3,000 genomes, many analysis tools.
Very useful for gene cluster analysis.
Browsers compatible with SEED: DOWNLOAD FIREFOX 3.6 (PC or Mac) or SAFARI 5.1 (PC or Mac)
To request a SEED account: Go to http://rast.nmpdr.org/rast.cgi * Click ‘Register a new account’,
complete the form, hit ‘Request’ button * After an automated email reply, a password will be emailed.
JGI Phylogenetic Profiler Tool http://img.jgi.doe.gov/cgi-
bin/w/main.cgi?section=PhylogenProfiler&page=phyloProfileForm Searches for correlated/anticorrelated
USING METABOLIC PATHWAY RESOURCES
• SWISS-PROT ENZYME Enzyme nomenclature database http://ca.expasy.org/enzyme/
ENZYME is a repository of information on enzyme nomenclature, with links to other databases. It describes
enzymes that have been given an EC (Enzyme Commission) number, and the reactions they catalyze. It can
be searched in various ways, e.g. by EC number, by common name, by substrate or product.
Example: alcohol dehydrogenase = EC 220.127.116.11 ENZYME entry page → Links to:
BRENDA (convenient entry point)
KEGG (Kyoto University Ligand Chemical Database (maps – glycolysis)
PDB (protein structure database)
Cloned enzymes in SwissProt (not exhaustive but curated, i.e. high quality)
• BRENDA Enzyme database http://www.brenda-enzymes.info/ BRENDA is an extensively
referenced enzyme data information system; it includes data on substrate specificity, physical and kinetic
characteristics, inhibitors, sources, cloning, purification etc.
Example: alcohol dehydrogenase EC 18.104.22.168
• KEGG Kyoto Encyclopedia of Genes and Genomes http://www.genome.jp/kegg/
KEGG computerizes knowledge of molecular and cell biology in terms of pathways that consist of
interacting molecules or genes and provides links from gene catalogs produced by genome sequencing. It
covers regulatory pathways and molecular assemblies as well as metabolic pathways. Its metabolic pathway
maps have links to the enzymes and compounds.
Example: KEGG PATHWAY * 1.8 Metabolism of Cofactors and Vitamins - Folate biosynthesis * Note
that all enzymes (EC numbers) and intermediates are clickable, e.g. * 22.214.171.124 and its * product (structure
can be captured). Note that this is a composite metabolic scheme. It includes methanopterin biosynthesis
(found only in methane-producing microbes) and tetrahydrobiopterin synthesis (found in animals). Note the
pulldown table (top left) of folate biosynthesis enzymes in different organisms; when an organism is
selected, the enzymes putatively encoded in its genome are colored green.
• EcoCyc, MetaCyc http://BioCyc.org/ and AraCyc
EcoCyc- Encyclopedia of E. coli Genes and Metabolism: Describes the genome and biochemical
machinery of E. coli. Contains annotations of all E. coli genes, and their DNA sequences, and describes all
known pathways of E. coli small-molecule metabolism. Each pathway and its component reactions and
enzymes have detailed annotations, and are extensively referenced.
MetaCyc - Metabolic Encyclopedia: A metabolic-pathway database that describes pathways, reactions, and
enzymes of various organisms, especially microbes. MetaCyc contains the E. coli pathways of EcoCyc, plus
other pathways from the literature and on-line sources, with citations to the sources of pathways.
Example: MetaCyc * Search tab * Pathways * List of all pathways * Glycine betaine biosynthesis plants *
Note that all elements in pathway are clickable.
AraCyc at TAIR http://www.arabidopsis.org/biocyc/index.jsp * Search AraCyc * Browse Ontology –
pathways * Biosynthesis * Amino acids * Superpathway of Lysine/Threonine/Methionine biosynthesis *
Click ‘More detail’ 2x to display genes corresponding to pathway steps.
• ORGANELLAR TARGETING
Example: 10-Formyltetrahydrofolate deformylase (PurU) is an enzyme found in E. coli and many other
bacteria (e.g., the cyanobacterium Nostoc) that hydrolyzes 10-formyltetrahydrofolate, releasing formate. The
Arabidopsis genome encodes two homologs of E. coli PurU (At5g47435 and At4g17360).
>E_coli gi|548645|sp|P37051|PURU_ECOLI FORMYLTETRAHYDROFOLATE DEFORMYLASE (FORMYL-FH(4) HYDROLASE)
>Nostoc gi|186681065|ref|YP_001864261.1| formyltetrahydrofolate deformylase [Nostoc punctiforme PCC 73102]
>At5g47435 gi|18422794|ref|NP_568682.1| formyltetrahydrofolate deformylase, putative [Arabidopsis thaliana]
>At4g17360 gi|15236046|ref|NP_193467.1| formyltetrahydrofolate deformylase, putative [Arabidopsis thaliana]
Targeting predictions for the At5g47435 and At4g17360 proteins using:
TargetP: http://www.cbs.dtu.dk/services/TargetP/ Paste in both Arabidopsis sequences * Check ‘Plant’,
‘Perform cleavage site predictions’
Predotar: http://urgi.versailles.inra.fr/predotar/predotar.html Paste in both Arabidopsis sequences
The consensus of the prediction algorithms is that both proteins are mitochondrial. To check this, align them
with the bacterial PurU sequences using Multalin http://bioinfo.genotoul.fr/multalin/multalin.html *
Alignment shows that both Arabidopsis proteins have N-terminal extensions of ~35 residues (this is a typical
size for a mitochondrial targeting peptide). * Align just the two Arabidopsis sequences – note that the N-
terminal extensions are not conserved (typical of targeting sequences).
Targeting – proteome databases with experimental findings
PPDB http://ppdb.tc.cornell.edu/ Click on ‘Accession’ * Paste AGI number(s) in box, e.g. At1g03475
(Coproporphyrinogen III oxidase) * Click on link(s) * Displays proteomic evidence in database and
SUBA II http://www.plantenergy.uwa.edu.au/suba2/ * Paste AGI number(s) in box, e.g. At1g03475 *
Click Arabidopsis Gene Identifier ‘Add’, then ‘Submit’ * Displays evidence.
• PHYLOGENETIC TREES
Using Phylogeny.fr: Select ‘One Click’ mode * Paste in 4 PurU sequences above * Click ‘Submit’ *
Carries out in sequence alignment with MUSCLE, Maximum Likelihood tree-building with PhyML, and
tree-drawing with TreeDyn. The analysis runs the aLRT statistical test, which gives results similar to the
bootstrap procedures but is much faster. * Download the tree in preferred image format. ‘Advanced’ and ‘A
la carte’ modes are available for experienced users.
Using MEGA5: Open MEGA5 * Under ‘Align’, select ‘Edit/build alignment’→ Select ‘New’ → Select
‘Protein’ → Paste in 4 Pur U sequences above → ‘Align by ClustalW’ → ‘OK’ * Under ‘Data’ select
‘Export Alignment’, ‘MEGA format’ → Name file, save in an appropriate folder, close box *
To observe alignment: Click on TA..Data * Open file * Click on TA icon → Click on ‘C’ to color identical
residues, close window
To draw a Neighbor-Joining phylogenetic tree: Under ‘Phylogeny’ select ‘Construct/test neighbor-joining
tree’ (this is the second option from the top) * Click ‘Yes’ if file is currently active; if no file is active, select
one * In ‘Analysis preferences box, set ‘Test of phylogeny’ to ‘Bootstrap method’ and select 1000 bootstrap
replications * Click ‘Compute’ * Expand tree window, click on ‘Bootstrap consensus tree’ * Click on ‘View’
to change style of tree * Click on ‘Image’ to save tree.
• MICROARRAY DATABASES
Microarrays: The 22K Affymetrix chip contains most Arabidopsis genes, so in principle it can be used to
monitor the expression of almost all metabolic genes. However, many metabolic genes have low expression
levels, and so cannot be monitored with confidence. Genes with low average expression levels tend to give
large numbers of spurious co-expression matches.
mRNA abundance in general correlates broadly with protein abundance and with in-vivo metabolic fluxes.
Therefore digital gene expression data can indicate which organs have a pathway and which do not, and
whether a pathway is likely to be a major or minor one. Note also that primary metabolic pathways are
expressed everywhere and always, and that secondary pathways by definition are not. Unexpected
differences in expression may provide clues about genetic control of pathways, e.g. an enzyme whose
transcript level varies more than that of others in the pathway (i.e. is highly regulated, not constitutive) may
be an important control point in the pathway.
Microarray-based gene expression profiling using the Golm Transcriptome database
For an overview of expression in different organs and in different environmental conditions: On face page,
paste in one or several AGI numbers e.g.
(At3g12930 is the plastid Iojap protein; At5g47190 is chloroplast ribosomal protein L19; At2g39800 is the
first enzyme of proline biosynthesis)
* Scroll down to graphs. Note positive correlation between At3g12930 and At5g47190. Note induction of
At2g39800 by stresses.
To search for positively and negatively correlated genes, go to Transcript Co-Response, Single Gene Query,
paste in At5g47190 * Select a dataset (‘Matrix’), e.g. developmental series * select an output, e.g. positive,
top 100 of co-responding genes * Scroll down list of hits – note many strong correlations with other
chloroplast ribosomal proteins, which associate together to form the protein complexes of the ribosome * At
bottom of page note pie-chart showing predominance of metabolic enzymes related to photosynthesis and
tetrapyrrole (= chlorophyll) biosynthesis * Repeat, changing output to negative, top 100 of co-responding
genes * Note disparate nature of correlated genes (which fall into many different categories in the pie chart).
Microarray-based gene expression profiling using ATTED http://atted.jp/
On face page Search box select ‘Gene ID’, paste an AGI number, e.g. At5g47190 in box, click ‘Search’ *
Click on link (‘Target’ box summarizes targeting predictions) * Displays coexpressed gene network around
At5g47190 * Note the many proteins related to plastid ribosomes * Click on coexpressed gene list for more
coexpressed genes * Check all 4 boxes * Default ranking is by all datasets *
Rankings in individual datasets (e.g. tissue type, abiotic stress) can also be displayed * In ‘Link’ column,
graph icon displays correlation data points * Osa homolog column shows putative rice orthog, clicking on
link displays correlation list for rice genes * Note many ribosome associations of rice homolog of best
Arabidopsis hit, At1g32990 (PRPL11)
Alternatively, on face page, click on ‘CoExSearch’ * Paste At5g47190 in box → Obtain same list as above
but checkable * Check top three genes on the list, click ‘Resubmit for selected query guide genes’ →
Recruits still more plastid ribosome-related genes.
On face page, click on ‘CoexViewer’ * Enter At5g47190 and At1g32990 in boxes * Plots correlation of
To draw a custom network around these two genes, on face page, click ‘NetworkDrawer’ * Paste
At5g47190 and At1g32990 into box, click ‘Submit’ * Displays two linked coexpression networks.
• DIGITAL GENE EXPRESSION PROFILES (ELECTRONIC NORTHERNS)
Another way to gather data on gene expression, based on the abundance of ESTs in libraries.
ESTs: In cDNA libraries from which many randomly selected clones have been sequenced, the relative
abundance of cDNAs reflects the relative abundance of mRNAs, so that differentially expressed genes can
be detected from variations in the counts of their cognate ESTs.
Constructing an EST-based gene expression profile: Capture the protein sequence of interest, e.g.,
‘histidine decarboxylase’ from tomato:
>Histidine decarboxylase – Lycopersicon esculentum (tomato)
Go to Gene Indices http://compbio.dfci.harvard.edu/tgi/ * Select Plant Gene Indices, Tomato, BLAST *
On BLAST page select tblastn, tomato * Hits several contigs (assemblies of ESTs from the same gene) –
showing that there is a small gene family. Best hit (identical to search sequence) = TC223282 * Click on
TC223282, scroll down, click on ‘Expression Summary’ button * Displays a pre-computed electronic
Northern is available.
Total ESTs found in TC223282: 202
% of library
Cat# Library # of ESTs
T1775 tomato breaker fruit 141 0.91
T1391 tomato red ripe fruit 23 0.59
Lycopersicon esculentum maturing
#GHQ 21 0.10
Solanum lycopersicum cv. Micro-Tom
#K3K 16 0.09
Normalized cDNA library from
#CN1 1 0.07
ripening tomato pericarp
This result shows that this protein is fruit-specific.
• PLANT PHENOME DATABASES
Although not as developed as phenotype databases for mutants in model microorganisms, there are several
such resources for plants, and they are growing fast.
SeedGenes http://www.seedgenes.org/ Covers ~350 Arabidopsis genes that give a seed phenotype when
disrupted by mutation. Click ‘Enter’, click ‘Access the SeedGenes Query Page’, ‘Browse genes’, search for
AGI numbers in list.
Chloroplast2010 http://www.plastid.msu.edu/ Has morphological and metabolic phenotype data for
>3,500 mutants in genes whose products are predicted to be chloroplast-targeted. In ‘News’ column
click ‘Here’ * In ‘Analysis Overvew’ click ‘Here’ → Log in or sign up to get an account * In Search by
Query Term(s) area, search by AGI number, e.g. At4g25050, At1g10310 (be sure to avoid blank spaces or
empty lines) → Click on links to genes * See tabs for morphology, leaf amino acid profile, etc
RAPID http://rarge.gsc.riken.jp/phenome/ Phenotypic data in transposon-insertional mutants. Click ‘Line
list’ → search for AGI number, e.g. At2g48120 → copy line code 11-2389-1 * Click on ‘Search’ → Paste
line code into search box * Click ‘Search’ → displays image of albino seedling
BAPDB http://bioweb.ucr.edu/bapdb/ Compiles mutant screening data such as quantitative assays. Click
on ‘Search BAP DB by Gene identifiers’ * Enter AGI number in search box, e.g. At1g32230, click ‘gene
search’ * Clink on links to screens, e.g. #99 → Displays root growth data in various environments.
USING GENOME RESOURCES TO FIND PLANT ENZYME GENES
This exercise demonstrates how to find Arabidopsis and maize genes encoding a metabolic enzyme, starting
from the sequence of a bacterial enzyme, 5,10-methylenetetrahydrofolate reductase, EC 126.96.36.199 (MetF).
Go to Swiss-Prot Enzyme, enter 188.8.131.52 → Click on link to E. coli MetF → Capture FASTA sequence *
Go to NCBI Protein BLAST search * Select Arabidopsis thaliana → Hits on MTHFR1 and MTHFR2
(At3g59970 and At2g44160) Note multiple entries for each gene → Capture full-length (about 590 residues)
FASTA text sequences, save to Word file * Align in Multalin to confirm their very high similarity.
To maize homologs, go to Maizesequence.org, click ‘BLAST’ in header bar * Paste either Arabidopsis
sequence in search box * Select ‘peptide queries’, ‘peptide database’, ‘Filtered gene set peptides’,
‘BLASTP’, search sensitivity ‘no optimization’, click ‘Run’ * In output, if necessary turn on all columns,
select ‘E-val’ in Stats and <E-val in Sort By * To see alignments, click [A] → Very strong hit,
GRMZM2G347056 (593 residues) on chromosome 1; also second hit, truncated (382 residues),
GRMZM2G034278 on chromosome 5. (Third hit is a small fragment) * To capture protein sequences, click
on GRMZM identifiers, ‘Protein sequence’, save to Word file.
Sequence alignment indicates that GRMZM2G034278 is distinct from GRMZM2G347056 and both
Arabidopsis sequences in lacking ~200 residues at the C-terminus, in having a very different N-terminal
region of ~80 residues. GRMZM2G034278 is thus almost certainly an incorrectly-called gene or a
To check whether these genes are expressed, use GRMZM2G034278 and GRMZM2G347056 protein
sequences in tBLASTn against maize ESTs:
- 50 exactly match GRMZM2G082463 (allowing for imperfections characteristic of EST sequences)
- None appear to exactly match GRMZM2G034278
Therefore, since the predicted GRMZM2G034278 protein is truncated, and has no cognate ESTs (i.e. is not
transcribed), it is most probably a pseudogene. Note that ~85% of the maize genome consists of hundreds of
families of transposable elements. These are responsible for capture and amplification of many gene
>MetF 5,10-methylenetetrahydrofolate reductase [Escherichia coli str. K-12 substr. MG1655]
>MTHFR1 gi|15232215|ref|NP_191556.1| methylenetetrahydrofolate reductase 1 [Arabidopsis thaliana]
>MTHFR2 gi|18406468|ref|NP_566011.1| methylenetetrahydrofolate reductase 2 [Arabidopsis thaliana]
USING COMPARATIVE GENOMICS RESOURCES
Comparative genomics can:
● Find genes for functions (i.e., the function is known to exist, but the gene specifying it has not been
● Find functions for genes (i.e., the gene is known from the genome, but its function is not)
It operates on the ‘guilt by association’ principle – ‘Show me your friends and I’ll tell you who you are’ or
‘Birds of a feather flock together’
Genomic evidence Post-genomic evidence
A B C D Gene X
Gene clustering Gene Z
Orf X Orf Y
Orf XY A
Gene fusion B
C V M
Protein-protein Organelle proteomes
Shared regulatory sites Essentiality & other phenome data
– – + +
Phylogenetic occurrence Structures
• STRING http://string.embl.de/
Functional relationships between proteins can often be inferred from genomic associations between the genes
that encode them: groups of genes involved in the same pathway tend to be close together (clustered) in
prokaryote genomes (often in operons), to be involved in gene-fusion events, and to show similar species
STRING is a precomputed database to explore functional relationships between proteins (clustering, fusions,
co-occurrence etc). STRING gives an integrated confidence score for the associations it predicts. It is the
often the best database to begin a comparative genomics project.
Example 1 – Rediscovering Nudix enzyme FolQ in Lactococcus lactis: Enter via a protein name, view
associations among proteins with that name.
* FolP (Dihydropteroate synthase (EC 184.108.40.206), a key enzyme of pterin and folate synthesis
* Select Lactococcus lactis MG1363 from organism list (results are similar but not identical using other
species) * Click Go!
* Displays ‘Evidence View’ - different line colors represent types of association (clustering on chromosome
(= Neighborhood), co-occurrence, co-expression, protein-protein interactions, etc)
* Spheres are gene products, can be dragged to disentangle networks * Filled spheres have protein structures,
all spheres are clickable for more information
* Click Confidence View’ - stronger associations are represented by thicker lines
* Note strong associations between the set of folate synthesis enzymes – FolP, FolB, FolC, HPPK – & FolQ
(before FolQ was known it was included in the network as unknown protein YlgG)
* To see more interactions, Click ‘+ More’ button, or expand list in pull-down menu e.g. to 50 interactors
* In ‘Evidence View’ screen, click on bullets in the table for more information, e.g. HPPK (=FolK) bullet in
‘Neighborhood’ → shows linkage between FolP & FolK in diverse genomes.
Notes: STRING data come from many genomes, not just the one used to enter the system (L.
lactis in this case)
The more diverse the genomes, the more probable it is that the linkage represents a
* Gene fusions – for the HPPK protein, click on ‘Gene Fusion’ bullet → Shows HPPK is fused to FolP in
various organisms, including Arabidopsis
Example 2 – Discovering the E. coli equivalent of FolQ: E. coli does not have a close homolog of FolQ.
Enter via a protein sequence, that of E. coli FolE:
>E. coli GTP cyclohydrolase I (EC 220.127.116.11)
* Select E. coli K12 MG1655 (classical laboratory strain) from organism list * Click Go! * On FolE page,
* Note strong associations of FolE with FolP, FolK, FolB, FolC & NudB * Click on NudB sphere →
Annotation is dihydroneopterin triphosphate pyrophosphohydrolase, Nudix family
* Click on UniProt link to go to data page that includes 2007 publication from Bessman’s group (PubMed:
17698004) demonstrating activity
* Alternatively, go to EcoliWiki http://ecoliwiki.net/colipedia/index.php/Welcome_to_EcoliWiki →
Shows NudB = dihydroneopterin triphosphate pyrophosphohydrolase
Example 3 – Predicting possible functions for the At3g13050 protein:
>At3g13050 [Arabidopsis thaliana]
* Suppose that we are investigating NAD(P) biosynthesis. Using ATTED, we find that the expression of
NADP synthesis enzyme At3g21070 NADK1 (a cytosolic isoform of NAD kinase) is positively correlated
with expression of At3g13050, which encodes a protein of unknown function in Arabidopsis (annotated
‘transporter-related’ in GenBank).
* TMHMM search shows that At3g13050 has multiple membrane-spanning domains
* BLASTp search of bacteria in GenBank → Conserved domain search indicates At3g13050 is a major
facilitator superfamily (MFS) transporter → Best hits (e-50 or better) include Deinococcus geothermalis,
Deinococcus deserti, and Deinococcus radiodurans
* Go to STRING, BLAST, select Deinococcus geothermalis * Note clustering with 3 enzymes of NAD
synthesis and 5 enzymes of thiamin synthesis
(Similar but not identical results with D. deserti or D. radiodurans. Note that it is important to try
more than one organism as an entry point to the system.)
* Click on Neighborhood bullets to see the organisms in which the clustering with NAD or thiamin synthesis
* Therefore a functional prediction is that At3g13050 transports NAD or thiamin, or a precursor of NAD or
* At3g13050 and various bacterial homologs are now known to transport the NAD precursor nicotinic acid.
Thermus thermophilus homolog is known to transport thiamin
Example 4 – Predicting possible functions for the At4g26860 protein:
>At4g26860 [Arabidopsis thaliana]
* Suppose that we are investigating proline biosynthesis. Using ATTED, we find that the expression of
At1g23310 GGT1, a peroxisomal glutamate:glyoxylate aminotransferase that is functionally linked to
proline synthesis, is very strongly correlated with expression of At4g26860, which encodes a protein of
unknown function in Arabidopsis (annotated ‘pyridoxal phosphate binding, alanine racemase family protein,
or putative proline synthetase associated protein’ in GenBank).
* TargetP and Predotar indicate plastid targeting of At4g26860
* BLASTp search of bacteria in GenBank → Conserved domain search indicates uncharacterized member of
the alanine racemase family (pyridoxal phosphate-containing) → Best hits (e-40 or better), e.g. Geobacter
sp. M21, Vibrio harveyi
* Go to STRING, BLAST, select Geobacter sp. M21 or Vibrio harveyi * Note extremely strong clustering
with the proline biosynthesis enzyme pyrroline-5-carboxylate reductase, ProC
* Note that pyrroline-5-carboxylate reductase is reported to be plastidial in plants, i.e. that it is in the same
subcellular compartment as At4g26860.
* Therefore one possible functional prediction is that At4g26860 participates in proline biosynthesis. One
step in proline synthesis (the cyclization of glutamic acid γ-semialdehyde to give Δ1-pyrroline-5-carboxylate)
is considered to be spontaneous – could At4g26860 accelerate this reaction?
• The SEED
SEED http://www.theseed.org/wiki/Home_of_the_SEED The SEED is a versatile tool for investigating
functional relationships between genes. Unlike STRING, it is not rigidly precomputed; the user has more
control. To explore SEED, we will use the latter two examples above, i.e. At3g13050 and At4g26860.
Example – Predicting possible functions for the At3g13050 protein:
* Go to PubSEED, Click ‘Navigate’ tab, select BLAST search * Paste At3g13050 sequence into box →
Select Deinococcus geothermalis (either entry) * Best hit 2e-61* Click on link * Opens Annotation
Overview page (the ‘Facebook page’ for the gene) (protein is annotated ‘Niacin transporter NiaP’ – note that
this is a prediction) * Note links to KEGG, to Psi-BLAST etc
* The ‘Compare Regions’ tool displays the chromosome region around the D. geothermalis niaP query gene,
and those around the four closest homologs of the query gene
* Similar genes have the same color → Hover over to see annotations * Note that niaP is in a cluster of NAD
synthesis genes in D. geothermalis
* Click on ‘Advanced’, expand number of regions to 400, relax both the cutoffs to 1e-10, check ‘collapse
close genomes’, click ‘Draw’
* Regions around homologs of the query gene are displayed from hundreds of genomes. The genes are
numbered in order of decreasing frequency of occurrence, 1 being the query gene, 2 being the most often
clustered, 3 being the next most often etc.
* Note NAD-related gene clusters also in Pyrobaculum islandicum (NAD kinase) and Thermotoga spp.
(Transcriptional repressor for NAD biosynthesis)
* Note thiamin-related gene clusters in Thermus thermophilus (1/4 way down page) and Pyrobaculum
islandicum (2/3 way down page) second cluster, next to thiamin-related one)
* Phylogeny of selected bacterial and plant proteins (and Bacillus subtilis, Acinetobacter sp. NiaP, shown to
transport niacin) places the plant genes closest to a gene clustered with NAD genes:
Niacin Bacillus subtilis
Niacin Acinetobacter sp.
Niacin Pyrobaculum islandicum
Thiamine Thermus thermophilus
Choline Burkholderia xenovorans
Niacin Thermotoga maritima
At3g13050 Arabidopsis thaliana
Niacin Deinococcus geothermalis
Thiamine Pyrobaculum islandicum
* Therefore, as with STRING, SEED prediction favors niacin transport (or possibly thiamin or its
Example – Predicting possible functions for the At4g26860 protein:
* Go to PubSEED, Click ‘Navigate’ tab, select BLAST search * Paste At4g26860 sequence into box *
Select Vibrio harveyi ATCC BAA-1116 * Best hit 2e-42 → Click on link * On Annotation Overview page
note descriptive annotation ‘Hypothetical protein YggS, proline synthase co-transcribed bacterial homolog
* Note adjacent to proline biosynthesis gene proC for pyrroline-5-carboxylate reductase (EC 18.104.22.168) * Click
on ‘Advanced’, expand number of regions to 400, relax both the cutoffs to 1e-10, check ‘collapse close
genomes’, click ‘Draw’ * Note the clustering with proC in many diverse genomes.
Note that the Vibrio harveyi gene and other bacterial genes are in two SEED subsystems. Subsystems
correspond to particular pathways, processes, or gene clusters, and are curated by experts.
Subsystems are basically spreadsheets where columns are proteins involved in a pathway or process and
rows are organisms (arranged taxonomically). Numbers in cells link to annotation pages.
Subsystems can display clustering information, essentiality data. They summarize expert knowledge of
pathways and processes and may include predictions about gene functions.
* Open subsystem ‘A Hypothetical Protein Related to Proline Metabolism’ in new tab * Click on
‘Description’ tag. This is a very simple subsystem, with just the At4g26860 homolog (named for the E. coli
protein YggS) and pyrroline-5-carboxylate reductase
* Click on ‘Functional roles’ tab – shows the proteins included in the subsystem * Click on ‘Subsystem
Spreadsheet’ tab * To display all genomes enter total number of genomes in ‘display __ items per page’ box,
hit return * Default is coloring by cluster; neighboring genes on the chromosome are highlighted in the same
color * Note clustering in diverse genomes
* To color by essentiality, click ‘coloring by attribute’ radio button * Select ‘Essential gene sets bacterial’
from drop-down menu * Click ‘Update’ * Note that the At4g26860 homolog is essential in Staphylococcus
aureus, Haemophilus influenzae, Helicobacter pylori
* Open subsystem ‘CBSS-630.2.peg.3360’ * Go to spreadsheet * This is a somewhat larger, experimental
subsystem, based solely on clustering, i.e. where function is not yet clear
* Display all genomes – Note clustering of additional genes with the At4g26860 homolog and pyrroline-5-
carboxylate reductase * One of these genes (yggU in E. coli) encodes a small protein, annotated ‘UPF0235
protein VC0458’ COG1872 * Click on link to this gene in E. coli K12 → Capture protein sequence
* Go to NCBI BLASTp, search Arabidopsis thaliana & Zea mays → Single gene in both plants, long N-
>YggU fig|83333.1.peg.2904 [Escherichia coli K12] [UPF0235 protein VC0458]
>At5g63440 gi|145334887|ref|NP_001078789.1| unknown protein [Arabidopsis thaliana]
>GRMZM2G099547 gi|223944751|gb|ACN26459.1| unknown [Zea mays]
* Run TargetP and Predotar predictions → Predict plastid location for Arabidopsis (i.e. same as At4g26860)
* Therefore a functional prediction is again that At4g26860 participates in proline biosynthesis in the plastid.
But with the help of the SEED subsystem approach, we can further predict that At4g26860 may work in
concert with At5g63440.
* Both the subsystems above are small ones. To review a more typical subsystem that covers proline
biosynthesis pathways click on ‘Navigate’ tab, select ‘Subsystems’ * Select ‘Proline Synthesis’ subsystem
* Display Spreadsheet * Note columns corresponding to all pathway enzymes, and that some genomes lack
key proline synthesis enzymes – These are proline auxotrophs