surveys by xiagong0815

VIEWS: 2 PAGES: 85

									BIOINFORMATICS
     Surveys




                                  1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
 Mark Gerstein, Yale University
 bioinfo.mbb.yale.edu/mbb452a
     Large-scale Database Surveys
              (contents)
•




                                                                    2 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
    Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
     Simplifying the Complexity of Genomes:
                Global Surveys of a
    Finite Set of Parts from Many Perspectives
                          1         2   3   4   5       6       7   8       9           10 11   12 13     14 15 16     17 18 19   20   …
                                                                                                                                           ~100000 genes




                                                                                                                                                         3 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                                                                           ~1000 folds
     (human)


(T. pallidum)
                          1     2   3   4   5       6       7           8       9   10 11         12 13      14 15 …
                                                                                                                                           ~1000 genes


 Same logic for sequence
 families, blocks, orthologs,
 motifs, pathways, functions....
 Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from,
 ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom,
 Pfam, Blocks, Domo, WIT, CATH, Scop....
                                                        Part = Homolog




4 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                        Part = Motif




5 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                        Part = Conserved Domains




6 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  Part = Ortholog
 COGs - Orthologs
Ortholog ~ gene with
precise same role in diff.
organism, directly related




                                        7 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
by descent from a
common ancesor


                  vs
                  Paralog

                   Ortholog,
                   homolog,
                   fold
               (Lipman, Koonin, NCBI)
                                                        What is an Ortholog?




8 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
     Large-scale Database Surveys
              (contents)
•




                                                                    9 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
    Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
     Hb        The Parts List: A Library of Known Folds
                 Alignment                Fusing into a
                 of Individual            Single Core
                 Structures               Structure
                                          Template




                                                          10 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                         P<.000001
    Mb

Statistics
to Establish
  Relation-
   ships
 (P-values)

                                 P<.001          P~1
                Fold
• Scop
            Classifications
   Chothia, Murzin (Cambridge)
   Manual classification, auto-alignments




                                             11 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
    available
   Evolutionary clusters
• Cath
   Thornton (London)
   semi-automatic classification with
    alignments
   class, arch, topo., homol.
• FSSP
   Sander, Holm (Cambridge)
   totally automatic with DALI
   objective but not always interpretable
    clusters
• VAST
                                                         Part = Fold




12 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
              Fold Library vs.
     Other Fundamental Data structures
Parts List Database; Statistical, rather than mathematical relationships and conclusions




                                                                                                                                                                       13 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
            (Large than physics and chemistry, Similar to Finance (Exact Finite Number of Objects (3,056 on NYSE by 1/98), descrip. by Standardized Statistics (even
            abbrevs, INTC) and groups (sectors)) Smaller than Social Surveys, Indefinite Number of People, Not Well Defined Vocabulary and statistics.
     Large-scale Database Surveys
              (contents)




                                                                    14 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
Chothia & Lesk, 1986 -- 32 points

                 • EMBO J 4: 823 (1986)




                                                  15 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                 • “The relation between the
                   divergence of sequence and
                   structure in proteins”
                 • 32 pairs of homologous
                   proteins
                 • RMS, percent identity
                  D = 0.40 e1.87H

                 • Now redo with >16,000 pairs
                   in scop +
                   auto-alignments (pdb95d)....
                                                         End of class on 11.15




16 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
 Chothia and
Lesk, revisited
 16K points




                        17 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
   C&L ‘86:
     D=.4 exp(1.9 H)
   Here:
     D= .2 exp(1.3 H)
     D= .2 exp(1.9 H)
                                                                                                                                                                                                                            10
                                                                                                                                  mean RMS Chothia & Lesk
 Problems with RMS                                                                                                                Chothia & Lesk D=0.40e
                                                                                                                                                         1.87H
                                                                                                                                                                    Weighted fit 1 and CL Fit are fit to percent
                                                                                                                                  weighted fit 1 weighted fit 1:    sequence identity from 25-100%, fit 2 to %
                                                                                                                                  weighted fit 2 D=0.78e
                                                                                                                                                         2.15H

  •      Dominated by worst-                                                                                                                                        seq id from 17-25%; fit 3 to 17%-100%. CL               8
                                                                                                                                  weighted fit 3 weighted fit 2:    Fit was calculated by varying the 0.40
         fitting atoms                                                                                                            CL Fit         D=.036e6.40H       parameter in the Chothia & Lesk function.




                                                                                                                                                                                                                                 18 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  •      Trimming is arbitrary                                                                                                    median weighted2.fi47H3:  t
                                                                                                                                  leskFitY D=0.68e
         (50%)                                                                                                                                                                                                              6




                                                                                                                                                                                                                                                                          RMS
                                                                                                                                                 CL Fit:
                                                                                                                                  leskDataY
  •      “Bunching up” between                                                                                                                   D=0.96e1.87H

         20% and 0% identity
                                                                                                                                                                                                                            4
                                                                                                     3.5
      mean RMS
      Chothia & Lesk   Chothia & Lesk                                                                3
      weighted fit 1   D=0.40e1.87H
                       weighted fit 1:    Weighted fit 1 and CL Fit are fit to percent
      weighted fit 2                      sequence identity from 25-100%, fit 2 to % seq
                       D=0.24e1.41H
      weighted fit 3
                       weighted fit 2:
                                          id from 17-25%; fit 3 to 17%-100%. CL Fit was              2.5
                                                                                                           RMS (50% trim)
      CL Fit                              calculated by varying the 0.40 parameter in the
                       D=(6.6E-5)e12.6H   Chothia & Lesk function.
      median


                                                                                                                                                                                                                            2
                       weighted fit 3:
                       D=0.22e1.68H                                                                  2
                       CL Fit:
                       D=0.20e1.87H

                                                                                                     1.5

                                                                                                     1

                                                                                                     0.5



100            80                      60          40
                                     % sequence identity
                                                                                            20   0
                                                                                                     0
                                                                                                                                                                                                                            0
                                                                                                                            100          80                        60                      40                      20   0
        Structural Comp. Score vs.Smith-
                Waterman Score

overcomes zero




                                           19 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
bunching, trimming
problem
Sstr =
100(21
- 11 exp (-0.0054 SWS)
 Problems with    Different Lengths give different
    Structural    scores.
                  Scores follow equation of the form:
Alignment Score   y=An+Mx+B




                                                        20 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
              ~in TZ:
 Modern       Pstr=10-10Pseq.05
statistical   in TZ
language      Pstr=10-6Pseq.274
                           overcomes length dependency




                                                         21 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
 Focus on
Twilight Zone

• Sequence Sig.




                           22 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  without structure
  signif.
   Protein motions
   small proteins
   low-res, NMR
• Struc. Sig. without
  Seq. signif.
   More in bottom-right
    than top-left
Relationship of Similarity in Sequence
       & Structure - Summary




                                                                                                             23 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                  Sequence     Structural   Features                       Limitations
                  Similarity   Similarity

  Traditional     Percent      RMS C       Well understood, in use        RMS depends most highly on
  Scores          sequence     separation                                  worst matches, requiring
                  identity                                                 arbitrary trimming

  Aligment        Sseq         Sstr         Analogous similarity scores,   Dependence on alignment
  Similarity                                Sstr depends most highly on    length
  Scores                                    best matches

  Modern          Pseq         Pstr         Statistical significance,      Not as familiar as RMS and
  Probabilistic                             unified framework for          percent identity, some residual
  Scores                                    different comparisons          length-dependency
     Large-scale Database Surveys
              (contents)




                                                                    24 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
Integrated                              Folds: scop+automatic                  finding
                                                                               parts in
                                                                              genome
                                        Orthologs: COGs
 Analysis                               “Families”: homebrew,
                                                                             sequences
                                                                             blast,
                                                                              y-blast,
 System:                                                          ProtoMap   fasta,
                                                                              TM, low-

   X-ref                                                                     complexity,




                                                                                                                      25 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                  &c
                                                                              (Altschul,

Parts with                                                                    Pearson,
                                                                               Wooton)


Genomes
One approach of many...
Much previous work on                                                                      part occurrence profiles
Sequence & Structure Clustering
CATH, Blocks, FSSP,
Interpro, eMotif, Prosite,
CDD, Pfam, Prints, VAST,
TOGA…
Remington, Matthews ‘80; Taylor, Orengo ‘89, ‘94; Thornton,
CATH; Artymiuk, Rice, Willett ‘89; Sali, Blundell, ‘90; Vriend, Sander
‘91; Russell, Barton ‘92; Holm, Sander ‘93+ (FSSP); Godzik,
Skolnick ‘94; Gibrat, Bryant ‘96 (VAST); F Cohen, ‘96; Feng, Sippl
‘96; G Cohen ‘97; Singh & Brutlag, ‘98
Cross-Reference:
FoldsSequences                                                                                                                                                                          (3) Organize
                                                                                                                                                                                         Sequences
   Organisms                                                                                                                                                                            by Genome
                                                                                                                                                                                         or Taxon
    1              A1
    2                       C1
                                                                          (2) Match




                                                                                                                                                                                                                                                                                                                                                              26 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
    3              B1
    4
    5
    6     B1                B1
                                                                          Sequences                                                                                                                                                                            Abbrev.         Kingdom
                                                                                                                                                                                                                                                                              (subgroup)
                                                                                                                                                                                                                                                                                                              Genome          Num. Reference
                                                                                                                                                                                                                                                                                                                              ORFs
    7
                                                                          (fasta,blast)                                                                                                                                                                          EC      Bacteria (gram negative)        Escherichia coli     4290   Blattner et al.
    8
    9              C1       A1
                                                                                                                                                                                                                         3                                       HI      Bacteria (gram negative)   Haemophilus               1680   TIGR
                                                                                                                                                                                                                                                                                                      influenzae
   10              D1       D1       D1
                                                   Structurally Uncharacterized (186)                                                                                                                                    +                                       HP      Bacteria (gram negative) Helicobacter pylori         1577   TIGR


                                                                                                                                                                                                                         5                                       MG      Bacteria (gram positive)   Mycoplasma
                                                                                                                                                                                                                                                                                                     genitalium
                                                                                                                                                                                                                                                                                                                              468    TIGR
                        1        4   3                          3     2            5    6             1        4       2   4


                                                                                   5
                                                                                                                                                                                                                                                                 MJ      Archaea (Euryarchaeota) Methanococcus
                                                                                                                                                                                                                                                                                                     jannaschii
                                                                                                                                                                                                                                                                                                                              1735   TIGR
               1    PDB Match (152)                    3    TM helix (30)               Coiled-Coil
               2    Low Complexity Region (116)        4    Linker Region (5)      6    All-alpha or All-beta Region
                                                                                                                                                                                                                                                                 MP      Bacteria (gram positive)   Mycoplasma
                                                                                                                                                                                                                                                                                                    pneumoniae
                                                                                                                                                                                                                                                                                                                              677    Himmelreich
                                                                                                                                                                                                                                                                                                                                     et al.

                                                                                                                                                                                                                                                                 SC          Eukarya (fungi)       Saccharomyces
                                                                                                                                                                                                                                                                                                     cerevisiae
                                                                                                                                                                                                                                                                                                                              6218   Goffeau et al.

                    Individual            Sequence
                    Structures             Families                             Folds                                                                                                                                                                            SS      Bacteria (Cyanobacteria) Synechocystis sp.           3168   Kaneko et al.



                                                  A1
                                                                                                                               class Fold# EC SC HI SS HP MJ MPMG total Fam. PDB                Rep. Struc.       Name


                                                  A2                                                                           /b 18
                                                                                                                               /b 24
                                                                                                                               +b 31
                                                                                                                                         60
                                                                                                                                         20
                                                                                                                                         37
                                                                                                                                              46
                                                                                                                                              69
                                                                                                                                              28
                                                                                                                                                   23
                                                                                                                                                   17
                                                                                                                                                   18
                                                                                                                                                        40
                                                                                                                                                        19
                                                                                                                                                        16
                                                                                                                                                             19
                                                                                                                                                             17
                                                                                                                                                             12
                                                                                                                                                                   7
                                                                                                                                                                  16
                                                                                                                                                                  40
                                                                                                                                                                        4
                                                                                                                                                                       10
                                                                                                                                                                        3
                                                                                                                                                                         3
                                                                                                                                                                         11
                                                                                                                                                                         3
                                                                                                                                                                              202
                                                                                                                                                                              179
                                                                                                                                                                              157
                                                                                                                                                                                    16
                                                                                                                                                                                    13
                                                                                                                                                                                    23
                                                                                                                                                                                         183
                                                                                                                                                                                         132
                                                                                                                                                                                         160
                                                                                                                                                                                               1xel
                                                                                                                                                                                               1gky
                                                                                                                                                                                               1fxd
                                                                                                                                                                                                      -
                                                                                                                                                                                                      -
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  NAD(P)-binding Rossmann Fold
                                                                                                                                                                                                                  P-loop Containing NTP Hydrolases
                                                                                                                                                                                                                  like Ferrodoxin

6 pairs                                                                                                                        /b 01




                                                                                                                                                                                                                                                                (4) Results in “Fold Table”
                                                                                                                                         45   36   13   22   11   10    54    146   37   399   1byb   -           TIM-barrel
                                                                                                                               /b 23    18   17    7    9   4     8    22     67    5    36   1pyd   a:2-181     Thiamin-binding
                                                                                                                               /b 04    15   11    7   10   1     9    55     63   13   132   2tmd   a:490-645   FAD/NAD(P)-binding

                                                  A3                                                                           +b 55
                                                                                                                                 b 27
                                                                                                                                 b 24
                                                                                                                                          8
                                                                                                                                          7
                                                                                                                                         13
                                                                                                                                               9
                                                                                                                                              10
                                                                                                                                               7
                                                                                                                                                    7
                                                                                                                                                    8
                                                                                                                                                    4
                                                                                                                                                         8
                                                                                                                                                         8
                                                                                                                                                         3
                                                                                                                                                             9
                                                                                                                                                             4
                                                                                                                                                             3
                                                                                                                                                                   3
                                                                                                                                                                   4
                                                                                                                                                                   3
                                                                                                                                                                        6
                                                                                                                                                                        3
                                                                                                                                                                        3
                                                                                                                                                                         6
                                                                                                                                                                         3
                                                                                                                                                                         3
                                                                                                                                                                               56
                                                                                                                                                                               47
                                                                                                                                                                               39
                                                                                                                                                                                     4
                                                                                                                                                                                     5
                                                                                                                                                                                    18
                                                                                                                                                                                          23
                                                                                                                                                                                          19
                                                                                                                                                                                         177
                                                                                                                                                                                               1sry
                                                                                                                                                                                               1fnb
                                                                                                                                                                                               1snc
                                                                                                                                                                                                      a:111-421
                                                                                                                                                                                                      19-154
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  Class-II-aaRS/Biotin Synthetases
                                                                                                                                                                                                                  Reductase/Elongation Factor Domain
                                                                                                                                                                                                                  OB-fold

                                                                            ("Superfold")                                      +b 11
                                                                                                                                 b 55
                                                                                                                               /b 15
                                                                                                                                         10
                                                                                                                                          9
                                                                                                                                          5
                                                                                                                                               8
                                                                                                                                              10
                                                                                                                                               5
                                                                                                                                                    4
                                                                                                                                                    5
                                                                                                                                                    4
                                                                                                                                                         8
                                                                                                                                                         5
                                                                                                                                                         4
                                                                                                                                                             2
                                                                                                                                                             2
                                                                                                                                                             5
                                                                                                                                                                   2
                                                                                                                                                                   2
                                                                                                                                                                   6
                                                                                                                                                                        2
                                                                                                                                                                        2
                                                                                                                                                                        3
                                                                                                                                                                         1
                                                                                                                                                                         2
                                                                                                                                                                         3
                                                                                                                                                                               37
                                                                                                                                                                               37
                                                                                                                                                                               35
                                                                                                                                                                                    11
                                                                                                                                                                                     7
                                                                                                                                                                                     3
                                                                                                                                                                                          48
                                                                                                                                                                                          19
                                                                                                                                                                                          22
                                                                                                                                                                                               1igd
                                                                                                                                                                                               1bdo
                                                                                                                                                                                               2ts1
                                                                                                                                                                                                      -
                                                                                                                                                                                                      -
                                                                                                                                                                                                      1-217
                                                                                                                                                                                                                  beta-Grasp
                                                                                                                                                                                                                  Barrel-sandwich hybrid
                                                                                                                                                                                                                  ATP pyrophoshatases


                                                  A4                                                                           /b 05
                                                                                                                               /b 60
                                                                                                                               +b 68
                                                                                                                                         10
                                                                                                                                          5
                                                                                                                                          4
                                                                                                                                               4
                                                                                                                                               7
                                                                                                                                               2
                                                                                                                                                    2
                                                                                                                                                    4
                                                                                                                                                    3
                                                                                                                                                         4
                                                                                                                                                         6
                                                                                                                                                         6
                                                                                                                                                             2
                                                                                                                                                             3
                                                                                                                                                             4
                                                                                                                                                                   2
                                                                                                                                                                   2
                                                                                                                                                                   2
                                                                                                                                                                        2
                                                                                                                                                                        1
                                                                                                                                                                        4
                                                                                                                                                                         3
                                                                                                                                                                         1
                                                                                                                                                                         3
                                                                                                                                                                               29
                                                                                                                                                                               29
                                                                                                                                                                               28
                                                                                                                                                                                     4
                                                                                                                                                                                     3
                                                                                                                                                                                     2
                                                                                                                                                                                          35
                                                                                                                                                                                          18
                                                                                                                                                                                           3
                                                                                                                                                                                               1zym
                                                                                                                                                                                               3pmg
                                                                                                                                                                                               1mat
                                                                                                                                                                                                      a:
                                                                                                                                                                                                      a:1-190
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  The "swivelling" beta/beta/alpha domain
                                                                                                                                                                                                                  Phosphoglucomutase, first 3 domains
                                                                                                                                                                                                                  Creatinase/methionine aminopeptidase
                                                                                                                               +b 39     6    4    3    4   4     1    11     24    3    42   1gad   o:149-312   like G3P dehydrogenase, Ct-dom
                                                                                                                               +b 18     5    4    4    1   2     2    12     21    3    23   1fkd   -           FKBP-like
                                                                                                                               /b 41     3    3    3    3   1     3    11     18    3    16   1opr   -           Phosphoribosyltransferases (PRTases)


                                                  B1                                                                              78
                                                                                                                               +b 10
                                                                                                                               +b 43
                                                                                                                                          1
                                                                                                                                          2
                                                                                                                                          4
                                                                                                                                               9
                                                                                                                                               2
                                                                                                                                               3
                                                                                                                                                    1
                                                                                                                                                    2
                                                                                                                                                    2
                                                                                                                                                         2
                                                                                                                                                         4
                                                                                                                                                         2
                                                                                                                                                             1
                                                                                                                                                             2
                                                                                                                                                             1
                                                                                                                                                                   1
                                                                                                                                                                   1
                                                                                                                                                                   1
                                                                                                                                                                        1
                                                                                                                                                                        2
                                                                                                                                                                        2
                                                                                                                                                                         1
                                                                                                                                                                         2
                                                                                                                                                                         2
                                                                                                                                                                               17
                                                                                                                                                                               17
                                                                                                                                                                               17
                                                                                                                                                                                     1
                                                                                                                                                                                     2
                                                                                                                                                                                     4
                                                                                                                                                                                          23
                                                                                                                                                                                           5
                                                                                                                                                                                          50
                                                                                                                                                                                               1oel
                                                                                                                                                                                               1dar
                                                                                                                                                                                               3grs
                                                                                                                                                                                                      a:(*)
                                                                                                                                                                                                      477-599
                                                                                                                                                                                                      364-478
                                                                                                                                                                                                                  GroEL, the ATPase domain
                                                                                                                                                                                                                  Ribosomal protein S5 domain 2-like
                                                                                                                                                                                                                  FAD/NAD-linked reductases, dimer-dom.
                                                                                                                                                                                                                                                               class Fold# EC SC HI SS HP MJ MPMG total Fam. PDB                     Rep. Struc.       Name
                                                                                                                               +b 09     3    4    3    1   2     1    11     16    3    12   1kpa   a:          HIT-like
                                                                                                                               /b 47                                                          1ulb


                                                                                                                                                                                                                                                               /b 18
                                                                                                                                          4    2    3    1   2     1    11     15    2    10          -           Purine and uridine phosphorylases


                                                                                                                                                                                                                                                                                                                                    1xel
                                                                                                                               +b 33                                                          1tig

                                                                                                                                                                                                                                                                           60   46   23   40   19    7    4   3    202   16   183          -
                                                                                                                                          3    1    3    3   2     1    11     15    2     3          -           IF3-like


                                                  C1                                                                           +b 26
                                                                                                                               +b 29
                                                                                                                                M 11
                                                                                                                                          2
                                                                                                                                          2
                                                                                                                                          2
                                                                                                                                               3
                                                                                                                                               5
                                                                                                                                               1
                                                                                                                                                    1
                                                                                                                                                    1
                                                                                                                                                    2
                                                                                                                                                         2
                                                                                                                                                         1
                                                                                                                                                         1
                                                                                                                                                             2
                                                                                                                                                             1
                                                                                                                                                             2
                                                                                                                                                                   1
                                                                                                                                                                   1
                                                                                                                                                                   2
                                                                                                                                                                        1
                                                                                                                                                                        1
                                                                                                                                                                        1
                                                                                                                                                                         1
                                                                                                                                                                         1
                                                                                                                                                                         1
                                                                                                                                                                               13
                                                                                                                                                                               13
                                                                                                                                                                               12
                                                                                                                                                                                     3
                                                                                                                                                                                     3
                                                                                                                                                                                     1
                                                                                                                                                                                           4
                                                                                                                                                                                          26
                                                                                                                                                                                           1
                                                                                                                                                                                               1stu
                                                                                                                                                                                               1one
                                                                                                                                                                                               1ecl
                                                                                                                                                                                                      -
                                                                                                                                                                                                      a:1-141
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  dsRBD & PDA domains
                                                                                                                                                                                                                  like Enolase, Nt-dom.
                                                                                                                                                                                                                  type I DNA topoisomerase
                                                                                                                                                                                                                                                                                                                                                       NAD(P)-bindin
1 pair                                                                                                                                                                                                                                                         /b 24
                                                                                                                                 b 23

                                                                                                                                                                                                                                                                                                                                    1gky
                                                                                                                                          1    3    1    1   1     1    11     10    1     1   1whi   -           Ribosomal protein L14
                                                                                                                               /b 31
                                                                                                                               /b 61
                                                                                                                               /b 13
                                                                                                                                          2
                                                                                                                                          1
                                                                                                                                               2
                                                                                                                                               1
                                                                                                                                                    1
                                                                                                                                                    1
                                                                                                                                                         1
                                                                                                                                                         1
                                                                                                                                                             1
                                                                                                                                                             1
                                                                                                                                                                   1
                                                                                                                                                                   1
                                                                                                                                                                        1
                                                                                                                                                                        1
                                                                                                                                                                         1
                                                                                                                                                                         1
                                                                                                                                                                               10
                                                                                                                                                                                8
                                                                                                                                                                                     1
                                                                                                                                                                                     1
                                                                                                                                                                                          10
                                                                                                                                                                                           4
                                                                                                                                                                                               1trk
                                                                                                                                                                                               3pgk
                                                                                                                                                                                                      a:535-680
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  Transketolase, Ct-dom.
                                                                                                                                                                                                                  Phosphoglycerate kinase                                  20   69   17   19   17   16   10   11   179   13   132          -           P-loop Contai
                                                  C2                                                                                     49    8   14   57   12    5     1    146   15   100   3chy   -           Flavodoxin-like



                                                                                                                                                                                                                                                               +b 31
                                                                                                                               /b 38                                                          2rn2

                                                                                                                                                                                                                                                                                                                                    1fxd
                                                                                                                                         24   54   15   11   4         4 5    117   19   112          -           Ribonuclease H-like motif
                                                                                                                                  02
                                                                                                                                 b 21
                                                                                                                               /b 30
                                                                                                                                          7
                                                                                                                                         14
                                                                                                                                          7
                                                                                                                                              18
                                                                                                                                              13
                                                                                                                                              13
                                                                                                                                                    6
                                                                                                                                                    3
                                                                                                                                                    4
                                                                                                                                                         9
                                                                                                                                                         3
                                                                                                                                                        10
                                                                                                                                                             4
                                                                                                                                                             2
                                                                                                                                                             2
                                                                                                                                                                       5 5
                                                                                                                                                                       2 1
                                                                                                                                                                       1 1
                                                                                                                                                                               54
                                                                                                                                                                               38
                                                                                                                                                                               38
                                                                                                                                                                                     4
                                                                                                                                                                                     2
                                                                                                                                                                                     7
                                                                                                                                                                                          33
                                                                                                                                                                                          44
                                                                                                                                                                                          83
                                                                                                                                                                                               1hdj
                                                                                                                                                                                               1lep
                                                                                                                                                                                               1srx
                                                                                                                                                                                                      -
                                                                                                                                                                                                      a:
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  Long alpha-hairpin
                                                                                                                                                                                                                  GroES-like
                                                                                                                                                                                                                  Thioredoxin-like
                                                                                                                                                                                                                                                                           37   28   18   16   12   40    3   3    157   23   160          -           like Ferrodoxi
                                                                                                                                                                                                                                                               /b 01
                                                                                                                               /b 56     8    4    2    4   2    4    2       26    3   105   2at2   a:          Asp-carbamoyltransferase, Cat.-chain


                                                  D1                                                                           +b 70
                                                                                                                               /b 44
                                                                                                                                M 12
                                                                                                                                M 16
                                                                                                                                          3
                                                                                                                                          2
                                                                                                                                          4
                                                                                                                                               6
                                                                                                                                               1
                                                                                                                                               1
                                                                                                                                                    3
                                                                                                                                                    3
                                                                                                                                                    4
                                                                                                                                                         3
                                                                                                                                                         5
                                                                                                                                                         3
                                                                                                                                                             3
                                                                                                                                                             6
                                                                                                                                                             2
                                                                                                                                                                  4
                                                                                                                                                                       3 3
                                                                                                                                                                       2
                                                                                                                                                                       4 4
                                                                                                                                                                               24
                                                                                                                                                                               23
                                                                                                                                                                               22
                                                                                                                                                                                     3
                                                                                                                                                                                     5
                                                                                                                                                                                     1
                                                                                                                                                                                          24
                                                                                                                                                                                          16
                                                                                                                                                                                           1
                                                                                                                                                                                               1mxa
                                                                                                                                                                                               1vid
                                                                                                                                                                                               1bgw
                                                                                                                                                                                                      1-101
                                                                                                                                                                                                      -
                                                                                                                                                                                                      -
                                                                                                                                                                                                                  S-adenosylmethionine synthetase. MAT
                                                                                                                                                                                                                  SAM-dependent methyltransferases
                                                                                                                                                                                                                  type II DNA topoisomerase
                                                                                                                                                                                                                                                                           45   36   13   22   11   10    5   4    146   37   399   1byb   -           TIM-barrel
                                                                                                                                                                                                                                                               /b 23
                                                                                                                                          3   10    2    3   1         1 1     21    1     4   1dkz   a:          like HSP70, Ct-dom.


                                                                                                                                                                                                                                                                                                                                    1pyd
                                                                                                                                 b 31

                                                                                                                                                                                                                                                                           18   17    7    9   4     8    2   2     67    5    36          a:2-181
                                                                                                                                          4    2    3    3   3         2 1     18    3    20   1bmf   a:24-94     like F1 ATP synthase, a & b sub., A-dom.
                                                                                                                                  21
                                                                                                                               /b 55
                                                                                                                               +b 71
                                                                                                                                          4
                                                                                                                                          3
                                                                                                                                          3
                                                                                                                                               2
                                                                                                                                               6
                                                                                                                                               2
                                                                                                                                                    4
                                                                                                                                                    1
                                                                                                                                                    3
                                                                                                                                                         3
                                                                                                                                                         2
                                                                                                                                                         3
                                                                                                                                                             2
                                                                                                                                                             1
                                                                                                                                                             2
                                                                                                                                                                  2
                                                                                                                                                                  2
                                                                                                                                                                       1 1

                                                                                                                                                                       1
                                                                                                                                                                         1
                                                                                                                                                                               17
                                                                                                                                                                               16
                                                                                                                                                                               16
                                                                                                                                                                                     5
                                                                                                                                                                                     1
                                                                                                                                                                                     5
                                                                                                                                                                                          54
                                                                                                                                                                                          29
                                                                                                                                                                                          10
                                                                                                                                                                                               1fha
                                                                                                                                                                                               1xaa
                                                                                                                                                                                               2pol
                                                                                                                                                                                                      -
                                                                                                                                                                                                      -
                                                                                                                                                                                                      a:1-122
                                                                                                                                                                                                                  Ferritin-like
                                                                                                                                                                                                                  Isocitrate/isopropylmalate dehydrogenases
                                                                                                                                                                                                                  DNA clamp
                                                                                                                                                                                                                                                                                                                                                       Thiamin-bindin
                                                                                                                                                                                                                                                               /b 04                                                               2tmd
                                                                                                                                  49     2    2    2    2   2         2 2     14    2    18   1bmf   a:380-510   Left-handed superhelix
                                                                                                                               /b 50     4    4    1    2   1         1 1     14    3    27   2ctb   -           Zn-dependent exopeptidases
                                                                                                                                                                                                                                                                           15   11    7   10   1     9    5   5     63   13   132          a:490-645   FAD/NAD(P)-
 (1) Structures in Folds (scop)
                                                                                                                               /b 43     4    1    2    3   1         1 1     13    1     7   1cde   -           Glycinamide ribonucleotide transformylase
                                                                                                                                 b 53     2    1    2    2   2    1      1     11    1     4   1lxa   -           Single-stranded left-handed beta-helix



                                                                                                                                                                                                                                                               +b 55
                                                                                                                                 b 38                                                          1pkn

                                                                                                                                                                                                                                                                                                                                    1sry
                                                                                                                                          2    2    1    2        1    1 1     10    1     7          116-217     Pyruvate kinase beta-barrel domain
                                                                                                                                 b 28
                                                                                                                               /b 03
                                                                                                                               +b 85
                                                                                                                                          2
                                                                                                                                          2
                                                                                                                                          1
                                                                                                                                               1
                                                                                                                                               2
                                                                                                                                               3
                                                                                                                                                    2
                                                                                                                                                    1
                                                                                                                                                    1
                                                                                                                                                         1
                                                                                                                                                         1
                                                                                                                                                         1
                                                                                                                                                             1
                                                                                                                                                             1
                                                                                                                                                                  1
                                                                                                                                                                       1 1
                                                                                                                                                                       1 1
                                                                                                                                                                       1 1
                                                                                                                                                                                9
                                                                                                                                                                                9
                                                                                                                                                                                9
                                                                                                                                                                                     1
                                                                                                                                                                                     1
                                                                                                                                                                                     3
                                                                                                                                                                                           6
                                                                                                                                                                                           1
                                                                                                                                                                                          43
                                                                                                                                                                                               1efu
                                                                                                                                                                                               1rlr
                                                                                                                                                                                               1mld
                                                                                                                                                                                                      a:297-393
                                                                                                                                                                                                      221-748
                                                                                                                                                                                                      a:145-313
                                                                                                                                                                                                                  EF-Tu, Ct-dom.
                                                                                                                                                                                                                  ribonucleotide reductase, R1 sub., Ct-dom.
                                                                                                                                                                                                                  like LDH/MDH, Ct-dom.
                                                                                                                                                                                                                                                                            8    9    7    8   9     3    6   6     56    4    23          a:111-421   Class-II-aaRS
                                                                                                                                                                                                                                                                 b 27
                                                                                                                                  15     1    1    1    1   1         1 1      7    1     3   1bmf   g:          F1-ATPase, gamma subunit
                                                                                                                               +b 24     1    1    1    1   1         1 1      7    1     1   1ctf   -           Ribosomal protein L7/12, Ct-dom.

                                                                                                                                                                                                                                                                            7   10    8    8   4     4    3   3     47    5    19   1fnb   19-154      Reductase/El
                                                                                                                                                                                                                                                                 b 24      13    7    4    3   3     3    3   3     39   18   177   1snc   -           OB-fold
                                                                                                                                                                                                                                                               +b 11      10    8    4    8   2     2    2   1     37   11    48   1igd   -           beta-Grasp
    Large-scale Example: Census DB

•   9 Genome Comparison




                                     27 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   1437 Relational Tables
•   442 Mb
•   Simple ASCII Layout
                                                         A Parts List Approach to Bike Maintenance




28 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
A Parts List Approach to Bike Maintenance
                               How many roles
                               can these play?
                               How flexible and
                               adaptable are they
                               mechanically?




                                                    29 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
What are the
shared parts (bolt,
nut, washer, spring,
bearing), unique                     Where are
parts (cogs,                         the parts
levers)? What are                     located?
the common parts -                   Which parts
- types of parts                      interact?
(nuts & washers)?
                                                          Folds
                                                         Shared




             E. coli
                         43
                                      21

                       16
                                                  35


                                149
                                                                  worm



                                      42

                         8
                                                         of 339




             yeast




30 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  Shared                      worm
   Folds                                         of 339
                                  35




                                                                                    31 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                         21             42
                                  149
                    43                       8
                                  16
E. coli                                            yeast
                                                   T.pallidum
           T.pallidum

                    0                                      2

               25        5                             2         3
                    93                                     116
          39                 45                   40                 18
                    73                                     71
E. coli                           yeast E. coli                           B. sub.
  Patterns of                                       ESHSHMMM (##) ESHSHMMM (##) ESHSHMMM (##) ESHSHMMM (##) ESHSHMMM (##)
                                                    CCISPJPG      CCISPJPG      CCISPJPG      CCISPJPG      CCISPJPG


Folds Usage in                                      11111111
                                                    1111....
                                                    ...1....
                                                                            (30)
                                                                            (09)
                                                                            (06)
                                                                                                  .1......
                                                                                                  11111...
                                                                                                  1.11....
                                                                                                             (23)
                                                                                                             (08)
                                                                                                             (05)
                                                                                                                    1.......
                                                                                                                    1.1.....
                                                                                                                    .1.1....
                                                                                                                               (19)
                                                                                                                               (08)
                                                                                                                               (05)
                                                                                                                                      11111.11
                                                                                                                                      1.111.11
                                                                                                                                      1.111...
                                                                                                                                                 (16)
                                                                                                                                                 (06)
                                                                                                                                                 (04)
                                                                                                                                                        111111..
                                                                                                                                                        11......
                                                                                                                                                        11.1....
                                                                                                                                                                   (16)
                                                                                                                                                                   (06)
                                                                                                                                                                   (04)
                                                    .1...1..                (04)                  ..1.....   (04)   111111.1   (03)   1111111.   (03)   1111..11   (03)
 8 Genomes                                          1111.1..
                                                    1.11.1..
                                                    111.....
                                                                            (03)
                                                                            (02)
                                                                            (02)
                                                                                                  .....1..
                                                                                                  ..111...
                                                                                                  .11.....
                                                                                                             (03)
                                                                                                             (02)
                                                                                                             (02)
                                                                                                                    1111.111
                                                                                                                    .1.11...
                                                                                                                    ......1.
                                                                                                                               (02)
                                                                                                                               (02)
                                                                                                                               (02)
                                                                                                                                      111...11
                                                                                                                                      1..1.1..
                                                                                                                                      ....1...
                                                                                                                                                 (02)
                                                                                                                                                 (02)
                                                                                                                                                 (02)
                                                                                                                                                        111.11..
                                                                                                                                                        1.1..1..
                                                                                                                                                        111..111
                                                                                                                                                                   (02)
                                                                                                                                                                   (02)
                                                                                                                                                                   (01)
                                                    111.1.11                (01)                  1.111..1   (01)   1.1111..   (01)   .1.1..11   (01)   .1.11.1.   (01)
                                            super   .11.1..1                (01)                  1....111   (01)   1..111..   (01)   1.1...11   (01)   1.1..11.   (01)




                                                                                                                                                                            32 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                            fold     fam.    fold   11....11                (01)                  11.1.1..   (01)   11.11...   (01)   111..1..   (01)   111.1...   (01)
                                                    .11...1.                (01)                  1.....11   (01)   1...11..   (01)   1.1.1...   (01)   ......11   (01)
                                                    ....1..1                (01)                  ...1.1..   (01)   ...11...   (01)   ..1.1...   (01)   .1....1.   (01)
     total in PDB            338      990      25
                                                    1....1..                (01)                  .......1   (01)
     in at least one of
     8 genomes               240      547      23
                                                                                                 120%
     present in this
     many genomes
                                                                                                                                                            superfold



                                                               Fraction of Total Known "Folds"
                       1        60    192       1                                                100%                                                       fold
                       2        32     82       4                                                                                                           family
                       3        23     54       3
                       4        27     53       3                                                80%
                       5        17     50       0
                       6        27     49       3
                       7        24     41       2
                                                                                                 60%
                       8        30     26       7

Sequence
                                                                                                 40%
 Families           Folds
                                     Superfold = fold
   A1                                that allows many                                            20%
   A2
                                     non-homologous
   A3                                                                                             0%
                ("Superfold")        seq. (Thornton)
   A4                                                                                                   0       1       2      3       4     5          6      7        8
   B1
                                                                                                                "Fold" Present in at Least this Many Genomes
    orthologs, homologs, folds, motifs
                                                         Whole Genome Trees




33 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
        Cluster Trees Grouping Initial
      Genomes on Basis of Shared Folds
                                                                                                     Cele




                                                                                                                                         34 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                            Scer




                                                                                                                   Mthe
                                                                                                                          Mjan


                                                                                                                                 Phor
                                                                             Cpne
                                                                           Ctra
                                                                   Tpal

                                                            Bbur                                                          Aful

                                                                                                                   Aaeo


                      Fold                “Classic”       Mpne

                                                                                                                          Syne

                      Tree                  Tree          Mgen
                                                                          Rpro      Hpyl



                                                                                           Mtub
                                                                                                            Hinf
                                                                                                                                  Ecol
                                                                                                  Bsub
                  D=S/T       S = # shared folds                    0.1



 20   10   30
                                                                          20 Genomes
                  D = shared fold dist.   T= total #
D=10/(20+10+30)
                  betw. 2 genomes         folds in both
Distribution of Folds in Various Classes




                                               35 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
               Fold
               Tree




      Unusual distribution of all-beta folds
Fold Tree    Compare with Ortholog
            Occurrence Trees, another
             “partial- proteome” tree




                                                                                           36 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
            (based on COGs scheme of Koonin & Lipman, similar approaches by Dujon, Bork, &c.)




 Ortholog
   Tree
 Compare with trees
    on spectrum of
 “levels”: single-gene
trees, whole-genome          TIM
                                            ribosomal




                                                         37 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                              protein
   composition trees            Single-gene Trees

                                             Ortholog
                                               Tree

                                             “Classic”
                                               Tree

    Fold
    Tree      AA & di-NT Composition Trees
                         (S Karlin)
            Common Folds in Genome, Varies Betw. Genomes
       Depends on comparison                                                                                                                                                num.
       method, DB, sfams v folds, &c                                                                                                                                      matches frac. all
       (new top superfamilies via y-
       Blast, Intersection of top-10 to                                     Top-10 Worm Folds                                                                             in worm worm
       get shared and common)                                                                                                                                             genome dom. in       in
                                                                                                            class                                                            (N)     (F) EC? SC?
                                                                           Ig                                 B                                                                830 1.7% 18       4




                                                                                                                                                                                                                                        38 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                           Knottins                         SML                                                                565 1.1%     0    3
                                                                           Protein kinases (cat. core)      MULT                                                               472 0.9%     1 142
                                                                           C-type lectin-like                A+B                                                               322 0.6%     0    1
                                                                           corticoid recep. (DNA-bind dom.) SML                                                                276 0.5%     1 10
                                                                           Ligand-bind dom. nuc. receptor     A                                                                257 0.5%     0    0
                                                                           alpha-alpha superhelix             A                                                                247 0.5%     6 114
                                                                           C2H2 Zn finger                   SML                                                                239 0.5%     0 78
                                                                           P-loop NTP Hydrolase              A/B                                                               235 0.5% 72 133
                                                                           Ferrodoxin                        A+B                                                               207 0.4% 83 114
                    M. genitalium                     B. subtilis                     E. coli                                     M. thermo-                                                                    S. cerevisiae                                                                     C
                                                                                                                                                                  A. fulgidus
                                                                                                                                 autotrophicum
Rank                  Superfamily       #            Superfamily       #            Superfamily       #                                                                                    Rank                  Superfamily      #                                                               Su
                                                                                                             Rank
                                                                                                                                                                                                           D
                                                                                                                                   Superfamily       #            Superfamily       #

                D                              D                              D                                                                                                                                                         x
                                                                                                                                                                                                                   P-loop

                                                                                                                             D                              D
                                                       P-loop                                                                        P-loop                         P-loop                        1                              249                                                             Pro
       1            P-loop hydrolase    60                            173         P-loop hydrolase   191            1                                93                            118                           hydrolyase
                                                     hydrolyase                                                                    hydrolyase                     hydrolyase
       2        =    SAM methyl-
                      transferase
                                        16
                                                     Rossmann
                                                       domain
                                                                      165
                                                                                    Rossmann
                                                                                      domain
                                                                                                     158            2
                                                                                                                                  Phosphate-
                                                                                                                                  binding barrel
                                                                                                                                                     54
                                                                                                                                                                  Rossmann
                                                                                                                                                                    domain
                                                                                                                                                                                   104
                                                                                                                                                                                                  2        x    Protein kinase   123
                                                                                                                                                                                                                                        D                                                         h

       3
                     Rossmann
                       domain
                                        13
                                                    Phosphate-
                                                    binding barrel
                                                                       79
                                                                                   Phosphate-
                                                                                   binding barrel
                                                                                                      64            3
                                                                                                                                   Rossmann
                                                                                                                                     domains
                                                                                                                                                     53
                                                                                                                                                                 Phosphate-
                                                                                                                                                                 binding barrel
                                                                                                                                                                                    56
                                                                                                                                                                                                  3
                                                                                                                                                                                                                Rossmann
                                                                                                                                                                                                                  domain
                                                                                                                                                                                                                                 90
                                                                                                                                                                                                                                                                                                 Liga
                                                                                                                                                                                                                                                                                                    N


                                                                                                                                                         
                        Class I                                                                                                                                                                                 RNA-binding
       4                                12         PLP-transferase     44         PLP-transferase     38            4              Ferredoxins       48           Ferredoxins       49            4                              75                                                               C-
                      synthetase                                                                                                                                                                                  domain


                                                                                                                                                                                                         =
                        Class II                                                                                                                                                                                SAM methyl-                                                                       a
       5
                      synthetase
                                        11         CheY-like domain    36         CheY-like domain    36            5        =     SAM methyl-
                                                                                                                                    tranferase
                                                                                                                                                     17     =     SAM methyl-
                                                                                                                                                                   tranferase
                                                                                                                                                                                    24            5
                                                                                                                                                                                                                 transferase
                                                                                                                                                                                                                                 63
                                                                                                                                                                                                                                                                                                  h


                                                                                                                                                          
                                                                                                                                                                                                               Ribonuclease H-
       6
                     Nucleic acid
                     binding dom.
                                        11     =     SAM methyl-
                                                      transferase
                                                                       30           Ferredoxins       35            6            PLP-transferases    15         PLP-transferases    18            6
                                                                                                                                                                                           Total ORFs
                                                                                                                                                                                                                     like
                                                                                                                                                                                                                                 57
                                                                                                                                                                                                                                 6218
                                                                                                                                                                                                                                                                                                 Ig s

Total ORFs                              479                            4268                           4268   Total ORFs                              1869                           2409
with Common                             105                             465                           458    with Common                              252                            309   with Common                            560
Superfamilies                          (22%)                          (11%)                          (11%)   Superfamilies                          (14%)                          (13%)   Superfamilies                         (9%)
 Common,
Shared Folds:
bb structure




                                                                                                   39 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                   336: 42
                           All share /b structure with repeated
                           R.H. bb units connecting adjacent
                           strands or nearly so (18+4+2 of 24)




                                                                                      HI, MJ, SC
                                                                                       vs scop
                                                                                          1.32




 P-loop     Flavodoxin   Rossmann                Thiamin                      TIM-
hydrolase       like       Fold                  Binding                     barrel
What are the most common folds:
 Overall? In plants? In animals?




                                                                                                                                           40 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                   Percent of Sequences




                                                                                                   Eubacteria
                                                                        Families
                                                                                                                Eukaryote




                                                                                           Virus
                                                                                   Total




                                                                                                                        Metazoan
                                                                                                                Plant


                                                                                                                                   Other
                                                                        Number
                                             Fold Name




                      Class
                 Pla nt Top-10                                                                        
                  /b    TIM-barrel                                       29       6       ³        7 20 2 13
                  O           like Ferredoxin                             17       4       2        2 17 ³                         8
                  /b         NTP Hydrolases containing P-loop             9       3       ³        5 3 2                          7
                  O           Protein Kinases (catalytic core)             1       4       3        ³           3        6         6
                  S           Small inhibitors, toxi ns, lectins          14       ³                            3        ³         ³
                  /b         Rossmann Fold (NAD binding)                 11       3       ³        7           3        1         3
                  O           RuBisCO (small subunit)                      1       ³                ³           2                  ³
                  b           like Concanavalin A                          6       ³       ³        ³           2        ³         2
                             like Hydrophobic Seed Protein                2       ³                            2
                  /b         like Ribonuclease H                         15       2       5        1           2        1         5

                 Metazoan Top-10                                                                                        
                 b     like Immunoglobulin                                32       13      ³        1           ³       25 ³
                  O           Protein Kinases (catalytic core)             1       4       3        ³           3        6         6
                             DNA -binding 3-helical bundle               13       3       ³        ³           2        5         ³
                             like Gl obin                                 3       2                1           ³        4         1
                  S           Classic Zinc Finger                              2   1       ³                             3         1
                  /b         NTP Hydrolases containing P-loop                 9   3       ³        5           3        2         7
                  b           Trypsin-like serine proteases                    4   1       1        ³                    2         ³
                             Cytochrome P450                                  1   1                ³           ³        2         1
                  S           like Gl ucocort. receptor (DNA-binding)          4   1       ³                             2         ³
                             EF-hand                                          3   1                ³           1        2         1
       At What Structural Resolution
         Are Organisms Different?
person protein     super-secondary       helix   individual




                                                              41 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
 plant fold (Ig)   structure (bb,TM-    strand     atom
                    TM, bb,)                (C,H,O...)




 1m       100Å                    10Å                 1Å
Practical Relevance of Structural Genomics
           1           2   3   4   5       6       7   8       9           10 11   12 13     14 15 16     17 18 19   20   …

                                                                                                                                (Pathogen
                                                                                                                               only folds as
                                                                                                                                 possible




                                                                                                                                                   42 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
   (human)

   (T. pallidum)                                                                                                                 targets)
           1       2   3   4   5       6       7           8       9   10 11         12 13      14 15 …
                                                                                                                                  Drug




                                                                                                                • OspA protein
                                                                                                                      in Lyme-disease
                                                                                                                       spirochete B. burgdorferi
                                                                                                                      previously identified as
                                                                                                                       the antigen for vaccine
                                                                                                                      has novel fold (C
                                                                                                                       Lawson)
     Large-scale Database Surveys
              (contents)




                                                                    43 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
    Know All Folds in a                                                                                                                               PDB
                                                                                                                                                      Match
    Genome: How are                                                                                                                      no
                                                                                                                                        func.
                                                                                                                                                 orig. '97
                                                                                                                                                  fasta

    we doing on MG?                                                                                           Poor          known
                                                                                                                             func.
                                                                                                                                                             1-way

                                                                                                                                                               y
                                                                                                               or                                            blast
• MG smallest genome with 479 ORFs




                                                                                                                                                                                         44 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                                              None        low-qual.                           2-way
• Separate PDB Match, TMs, LC (SEG),                                                                                       TM, LC,
                                                                                                                             link
                                                                                                                                                          all
                                                                                                                                sig.+                     matches
  linkers                                                                                                                        link    low
                                                                                                                                                 TM
• How many residues in genome matched by                                                                                                cplx.

  known folds, in 1975, ‘76, ‘77...’00...’50                                                                                           Good
• The impact of PSI-blast in comparison to                                                                                           Prediction
  pairwise methods
           Two way PSI-blast gives an improvement
                                                                                                              100%
            (genome vs PDB, PDB vs. genome)
                                                                                                               90%        Fraction of the MG Genome
• Union of many sets of PDB matches finds                                                                                 (by residue) with Structural
                                                                                                               80%
  >40% of a.a. and more than half the ORFs                                                                                   Annotation over Time
                                                                                                               70%
  (242/479)
                                                                                                               60%
           (Eisenberg, Godzik, Bork, Koonin, Frishman)
                                                                                                               50%
• ~65% structurally characterized                                                                              40%
                                  Structurally Uncharacterized (186)
                                                                                                               30%                                                   PDB matches

        1    4     3                           3     2            5    6             1        4       2   4    20%
                                                                                                                               Good TMs, Low-complexity Regions
                                                                                                               10%
1   PDB Match (152)                  3     TM helix (30)          5    Coiled-Coil
2   Low Complexity Region (116)      4     Linker Region (5)      6    All-alpha or All-beta Region            0%
                                                                                                                     74   76    78    80 82     84   86    88   90   92   94   96   98
                                                                                                                                              55%

     Know All Folds in                                                                                                                        50%




                                                                                                                 Fraction of a.a. in Genome
      Genome: MG                                                                                                                              45%
                                                                                                                                                                                                                FIT
                                                                                                                                                                                                                SC
                                                                                                                                                                                                                MJ


       Optimistic                                                                                                                            40%                                                               HI
                                                                                                                                                                                                                MP
                                                                                                                                                                                                                MG
                                                                                                                                              35%
                                                                                                                                                                                                                EC




                                                                                                                                                                                                                45 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
        Prediction                                                                                                                            30%
                                                                                                                                                                                                                SS
                                                                                                                                                                                                                HP

                                                                                                                                              25%
                                                                                                                                                 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
• Just use one pairwise method for                                                                                                                                        Year

  matching                                                                                                                                    100%
                                                                                                                                              90%
• Multiple, big genomes (e.g. SC)



                                                                                                               Fraction of a.a. in Genome
                                                                                                                                              80%                                                               FIT
                                                                                                                                              70%                                                               SC
                                                                                                                                                                                                                MJ
                                                                                                                                              60%                                                               HI
                                                                                                                                              50%                                                               MP
                                                                                                                                                                                                                MG
                                                                                                                                              40%
                                                                                                                                                                                                                EC
                                                                                                                                              30%                                                               SS
                                                                                                                                              20%                                                               HP
                                   Structurally Uncharacterized (186)
                                                                                                                                              10%
        1     4     3                           3     2            5    6             1        4       2   4                                   0%
                                                                                                                                                 1970   1980   1990   2000   2010   2020   2030   2040   2050
 1   PDB Match (152)                  3     TM helix (30)          5    Coiled-Coil                                                                                          Year
 2   Low Complexity Region (116)      4     Linker Region (5)      6    All-alpha or All-beta Region
                                                             • TM prediction (KD, GES).                                     • Divide Predictions into sure
                                                               Count number with
 TM-helix                                                      2 peaks, 3 peaks, &c.
                                                                                                                              and marginal
                                                                                                                              (Boyd & Beckwith’s criteria)
                                                             • Similar conclusions to others:
“prediction”                                                   von Heijne, Rost, Jones, &c.
                                                                                                                                                                            25%




                                                                                                                             (as a fraction of total number of sequences)
                                                                                                                                                                                                                                                 Bacteria (HI)

                                                                                                                                                                                                                                                 Eukaryote (SC)
                                                                                                                                                                            20%




                                                                                                                                                                                                                                                                                      46 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                                                                                                                                                                                 Archaeon (MJ)




                                                                                                                                          Frequency in Genome
                      2500
                                                                                                                                                                            15%




                                                                                                                                                                            10%
                      2000
                                                             marginal
Number of Worm ORFs




                                                             sure                                                                                                           5%


                      1500
                                                                                                                                                                            0%
                                                                                                                                                                                     1       2     3     4       5       6     7     8       9      10 11 12 13 14

                                                                                                                                                                                                         Number of TM Helices
                      1000

                                                                                                           3.0%



                                                                                    Freq. in worm genome
                                                                                                                                                                  TM                             Marginal                                                 Soluble
                                                                                                           2.5%
                      500
                                                                                                           2.0%
                                                                                                           1.5%
                        0                                                                                  1.0%
                             1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19                                                                                                                 Thresholds
                                                                                                           0.5%
                                       Number of TM helices per ORF
                                                                                                           0.0%
                                                                                                                  -3.00
                                                                                                                          -2.75
                                                                                                                                                       -2.50
                                                                                                                                                                             -2.25
                                                                                                                                                                                     -2.00
                                                                                                                                                                                                 -1.75
                                                                                                                                                                                                         -1.50
                                                                                                                                                                                                                     -1.25
                                                                                                                                                                                                                             -1.00
                                                                                                                                                                                                                                     -0.75
                                                                                                                                                                                                                                                 -0.50
                                                                                                                                                                                                                                                         -0.25
                                                                                                                                                                                                                                                                 0.00
                                                                                                                                                                                                                                                                        0.25
                                                                                                                                                                                                                                                                               0.50
                                                                                                                                                                                             Min H value
                                                                                                                                                                                                                                                 • Overall, no strong
                       Comparative Genomics                                                                                                                                                                                                        preference for particular
                                                                                                                                                                                                                                                   supersecondary structures
                       of Membrane Proteins                                                                                                                                                                                                          Freq. of Number of TM
                                                                                                                                                                                                                                                       helixes follows a Zipf-
                                                                                                                    25%
                                                                                                                                                                                                                                                       like law: F=1/[5n2]
                           •  Yeast has more
                                                                                                                                                                                                                                                 • In detail, worm has a peak


                                                                     (as a fraction of total number of sequences)
                                                                                                                                                                Bacteria (HI)




                                                                                                                                                                                                                                                                                          47 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                             mem. prots.,                                                                           20%
                                                                                                                                                                Eukaryote (SC)
                                                                                                                                                                Archaeon (MJ)


                             esp. 2-TMs                                                                                                                                                                                                            for 7-TMs and E. coli for
                                                                                  Frequency in Genome
                                                                                                                    15%


                           • Similar                                                                                                                                                                                                               12-TMs
                             conclusions to                                                                         10%



                                                                                                                                                                                                                                      100
                             others: von                                                                            5%


                             Heijne, Rost,




                                                                                                                                                                                     Frequency (as a percentage of total sequences)
                                                                                                                                                                                                                                                                              FIT
                             Jones, &c.                                                                             0%
                                                                                                                          1    2   3   4    5   6   7   8   9    10 11 12 13 14

                                                                                                                                       Number of TM Helices                                                                                                                   SC
                                                                                                                                                                                                                                       10
                                                                                                                                                                                                                                                                              MJ
                       12.0%
                                                                                                                                                                                                                                                                              HI
                                                                                                                                                        worm
                       10.0%                                                                                                                                                                                                                                                  MP
                                                                                                                                                        yeast
                                                                                                                                                                                                                                        1
Frac. of Genome ORFs




                                                                                                                                                        E. coli                                                                                                               MG
                       8.0%
                                                                                                                                                                                                                                                                              EC
                       6.0%
                                                                                                                                                                                                                                                                              SS
                                                                                                                                                                                                                                       0.1
                       4.0%                                                                                                                                                                                                                                                   HP


                       2.0%


                       0.0%                                                                                                                                                                                                           0.01
                               1   2   3   4   5   6    7   8   9   10                                              11        12       13       14       15         16          17                                                           1                 10                   100
                                                       Number of TM helices                                                                                                                                                                            Number of TM Helices
          2º Structure                                                                                          Fraction of
                                                                                                                residues
           Prediction                                                                                           Predicted
                                                                                                                to be in...   strand   helix
                                                                                                                  Avg            17%   39%
• Bulk prediction of 2º struc. in genomes
• Same fraction of  and b (by element,
                                                                                                                  SD              1%    2%




                                                                                                                                               48 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  half each)
                                                                                                                   EC           17%     39%
• Both overall and only for unknown                                                                                HI           16%     41%
  soluble proteins.                                                                                                HP           15%     42%
                                    Structurally Uncharacterized (186)                                             MG           17%     39%
                                                                                                                   MJ           19%     37%
         1     4     3                           3     2            5    6             1        4       2   4
                                                                                                                   MP           17%     39%
  1
  2
      PDB Match (152)
      Low Complexity Region (116)
                                       3
                                       4
                                             TM helix (30)
                                             Linker Region (5)
                                                                    5
                                                                    6
                                                                         Coiled-Coil
                                                                         All-alpha or All-beta Region
                                                                                                                   SC           17%     34%
                                                                                                                   SS           16%     38%
• Diff From PDB:
  31% helical and 21% strand.
                                                                                           Not expected
• Related results: Frishman
                                                                                           since.…..
 Different                                   Amino Acid Composition
                                                                                           Propensity
                                                                                           (kcal/mole)
Amino Acid                            EC     HI    SS   SC    HP    MP    MG     MJ     TM-hlx helix strand

Composition                    K
                               C
                                       4.4 6.3 4.2
                                       1.2 1.0 1.0
                                                        7.3
                                                        1.3
                                                               8.9 8.6 9.5 10.4
                                                               1.1   .8  .8  1.3
                                                                                           8.8
                                                                                            -2
                                                                                                 -1.5
                                                                                                 -1.1
                                                                                                         -0.4
                                                                                                         -0.8

Should Give                    R
                               N
                                       5.5 4.5 5.1
                                       4.0 4.9 4.0
                                                        4.5
                                                        6.1
                                                               3.5 3.5 3.1 3.8
                                                               5.9 6.2 7.5 5.3
                                                                                          12.3
                                                                                           4.8
                                                                                                 -1.9
                                                                                                   -1
                                                                                                         -0.4
                                                                                                         -0.5




                                                                                                                49 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Different 2º                   Q
                               A
                                       4.4 4.6 5.6
                                       9.5 8.2 8.5
                                                        3.9
                                                        5.5
                                                               3.7 5.4 4.7 1.5
                                                               6.8 6.7 5.6 5.5
                                                                                           4.1
                                                                                          -1.6
                                                                                                 -1.3
                                                                                                 -1.9
                                                                                                         -0.4
                                                                                                            0
                               I       6.0 7.1 6.3      6.6    7.2 6.6 8.2 10.5           -3.1   -1.2    -1.3
 Structure                     H
                               S
                                       2.3 2.1 1.9
                                       5.8 5.8 5.8
                                                        2.2
                                                        9.0
                                                               2.1 1.8 1.6 1.4
                                                               6.8 6.5 6.6 4.5
                                                                                             3
                                                                                          -0.6
                                                                                                 -1.1
                                                                                                 -1.1
                                                                                                         -0.4
                                                                                                         -0.9
                               M       2.8 2.4 2.0      2.1    2.2 1.6 1.5 2.2            -3.4   -1.4    -0.9
                               P       4.4 3.7 5.1      4.3    3.3 3.5 3.0 3.4             0.2      3   >3.0
Each a.a. has different
propensity for local           G       7.4 6.6 7.4      5.0    5.8 5.5 4.6 6.3              -1      0     1.2
structure                      F       3.9 4.5 4.0      4.5    5.4 5.6 6.1 4.2            -3.7     -1    -1.1
->                             E       5.7 6.5 6.0      6.5    6.9 5.7 5.7 8.7             8.2   -1.2    -0.2
Different Compositions (K      Y       2.9 3.1 2.9      3.4    3.7 3.2 3.2 4.4             0.7   -1.2    -1.6
from 4.4 in EC to 10.4 in      V       7.1 6.7 6.7      5.6    5.6 6.5 6.1 6.9            -2.6   -0.8    -0.9
MJ, Q too)                     T       5.4 5.2 5.5      5.9    4.4 6.0 5.4 4.0            -1.2   -0.6    -1.4
->                             D       5.1 5.0 5.0      5.8    4.8 5.0 4.9 5.5             9.2     -1     0.9
Different Local Structure      L      10.6 10.5 11.4    9.6   11.2 10.3 10.7 9.5          -2.8   -1.6    -0.5
(but compensation?)            W       1.5 1.1 1.6      1.0     .7  1.2 1.0   .7          -1.9   -1.1      -1

Propensities from Regan      total propensity
(beta) and Baldwin (alpha)           -1.00 -1.02 -0.96 -1.00 -1.05 -1.03 -1.05 -1.01
                                b     -0.27 -0.33 -0.26 -0.36 -0.37 -0.38 -0.42 -0.36
   Supersecondary structure words
• Look at super-secondary          Super-     Maximum              Relative Abundance

  patterns (“words” such as 
                                 Secondary    Difference              (Odds Ratio)
                                  Structure   between 3
  or bb) in predictions           "Word"     Genomes        HI       MJ        SC      PDB




                                                                                               50 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
• Compare observed freq. with    bb
                                 
                                                      26%
                                                      15%
                                                            0.96
                                                            0.97
                                                                     1.06
                                                                     0.85
                                                                               1.24
                                                                               0.83
                                                                                        1.22
                                                                                        0.85
  expected freq.                 b                   10%   1.09     1.09      0.99     0.95
                                 b                    7%   0.98     1.00      0.93     0.99
    odds = f(b)/f()f(b)
                                 bbb                  41%   0.96     1.15      1.46     1.62
  (Freq. Words, Karlin)                            19%   1.01     0.83      0.84     0.92
                                 b
• Do have differences between    b
                                                      18%
                                                      15%
                                                            1.04
                                                            1.03
                                                                     1.03
                                                                     0.97
                                                                               0.87
                                                                               0.89
                                                                                        1.16
                                                                                        0.70
  genomes (and PDB) here         bb                  12%   1.15     1.24      1.10     1.19
                                 b                  11%   0.93     0.87      0.83     0.78
                                 bb                   9%   0.90     0.94      0.99     0.82
                                 bb                   6%   0.97     0.98      1.03     0.80
  HI more , ,  ...
                                 bbbb                 54%   1.03     1.35      1.78     2.28
                                                  29%   1.10     0.82      0.89     1.18
                                 bbb                 25%   0.85     0.94      1.10     0.98
                                 bb                 23%   1.11     1.18      0.94     1.48
  SC more bb, bbb, bbbbb...      bb                 21%   1.21     1.23      0.99     1.39
                                 b                 21%   1.00     0.95      0.81     1.00
                                 …                …         …         …         …        …
  MJ more bb, bb …
     Large-scale Database Surveys
              (contents)




                                                                    51 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
                                                                                         • Over-representation of certain species and functions
 An Issue with                                                                             in the databanks (e.g. human v. plant globins, Ig’s)
                                                                                              • Nevertheless HI top-10 like eubacterial top-10
 Fold Counting:                                                                          • PDB small, biased sample of genome (6-12%)
                                                                                         • Diff. numbers with diff. comparison sensitivity
  Biases in the                                                                               • FASTA, HMM, &c
                                                                                              • Some Correction with Seq. Weighting, Diff. Sampling
                                                                                              • Uniform sampling is better than high sensitivity for some and low
   Databanks




                                                                                                                                                                    52 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                                for others (y-blast problem)
                                                                                              • Best to avoid FPs than FNs for Venn


  Example                                                 Percentage of    Rank in
  Structure                    Fold                        known folds    eubacterial
   (PDB)                       Name                         in genome       Top-10


Top-10 in a bacterial genome (H. influenzae)
 2HSD-A Rossmann Fold (NAD binding)                            9.6            1
 1AKE-A NTP Hydrolases containing P-loop                       5.7            3
 1RCF       Flavodoxin-like                                    5.1            4
 6TIM-B TIM-barrel                                             4.5            2
 1FXD       Ferredoxin-like                                    4.2            5
 2RN2       like Ribonuclease H                                3.0            16
 1SBP       like Periplasmic binding protein (class II)        3.0            11
 2DRI       like Periplasmic binding protein (class I)         3.0            19
 1SRY-* Class II aaRS and biotin synthetases                   2.7            50
 1PYP       OB-fold                                            2.7            9




                                                                                        Same Issues with
                                                                                        Real US Census!!
                                                                                           Sampling
• Databank has biases.
• Assuming "fair"                            Using a Tree to
  distribution spreads                      Correct for Biases
  sequences uniformly
  through "space", want to                          A        B     C        D




                                                                                 53 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                     100%
  weight sequences:
                                                x        1
   over-represented, down
    (mammal)
   under-represented, up (plant                y
    & NV)                                                    2
                                     50%
• Weights derived from a
                                                z
  tree                                                                 3
   Length of an unshared
    branch is allotted directly to
    sequence                          0%
   Length of a shared branch is
    divided proportionally among               .8       .8       1.1       1.4
    sequences
                           Other schemes (Argos, Sander)
   Different
                                         Ion pairs in
 Perspectives                              GluDHs
  on Protein
Thermostability




                                                    54 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
In depth focus on single molecule
vs. broad view of many (all?)
proteins. Anectdotal vs.
Comprehensive (the genomic
perspective)
                  Change in entropy of
                     unfolded state in
                    engineering of TLP
                        (disulfides)
Thermostability: Analyzing a few Factors
      with Genome Comparison
       tertiary (EK)                  Organism             Category        Genome         # of     Physiological
                                                                         Abbreviation   Proteins   condition
                               Pyrococcus horikoshii                                      2061
                               (Strain OT3)              archaea            OT                     98 C,
               local (DK)      (Kawarabayasi et al.,                                               anaerobe




                                                                                                                      55 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  _       _ +           +      1998)
                               Aquifex aeolicus           eubacteria,                    1522
CEEEEHHHHHHHHHCCEEEEEEEEECC    (Deckert et al., 1998)    gram negative      AA                     95 C
CMEAPAGNIDIIKAGMKSPVQLTVKNDT   Methanococcus
                               janaschii                 archaea            MJ           1735
                                                                                                   85 C,
                               (Bult et al., 1996)                                                 anaerobe
                               Archaeoglobus fulgidus                                    2409
                               (Klenk et al., 1997)      archaea            AF                     83 C,
                                                                                                   anaerobe
                               Methanobacterium                                          1869
                               thermoautotrophicum       archaea            MT                     65 C,
                               (Smith et al., 1997)                                                anaerobe
                               Haemophilus influenzae     eubacteria,                    1680      mesophilic temp.
                               (Fleischmann et al.,      gram negative       HI
                               1995)
                               Mycoplasma genitalium      eubacteria,                     470      mesophilic temp.
                               (Fraser et al., 1995)     gram positive      MG
                               Mycoplasma                 eubacteria,                     677      mesophilic temp.
                               pneumoniae                gram positive      MP
                               (Himmelreich et al.,
                               1996)
                               Helicobactor pylori        eubacteria,                    1590      mesophilic temp.
                               (Tomb et al., 1997)       gram negative      HI
                               Escherichia coli           eubacteria,                    4288      mesophilic temp.
                               (Blattner et al., 1997)   gram negative      EC
                               Synechocystis sp.         cyanobacteria                   3168      mesophilic temp.
                               (Kaneko et al., 1996)                        SS
                               Saccharomyces              eukaryote,                     6218      mesophilic temp.
                               cerevisiae (Goffeau et      fungus           SC
                               al., 1997)
    Composition Analysis of the Proteome
More Charged Residues in Thermophiles, Suggestive of Salt Bridges



        Mesophile        Thermophile




                                                                    56 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                   predicted
                                                  helices in all
                                                     ORFs


                                          +
                                          +
                                          -
                                          -




            whole genome
1-4 Spacing of Charged                                             0.70
                                                                                           EK(3)
                                                                   0.60                    EK(4)
  Residues More than                                               0.50


Expected in Thermophile




                                                       LOD value
                                                                   0.40



 Helices  Salt Bridges
                                                                   0.30

                                                                   0.20




                                                                                                                                          57 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                   0.10
Quantify with LOD score                                            0.00
LOD = log (observed/expected)                                             MP   MG   EC   SC    HP   SS    HI   MT    MJ    AF   AA   OT

For inst.,                                                                               10 to 45
                                                                                    Mesophile
                                                                                                                65   85    83   95
                                                                                                                      Thermophile
                                                                                                                                     98


expected[EK(4)] ~ f(E)*f(K)                                                               Physiological temperature in C


LOD > 0, greater than expected




                               tertiary (EK)

                                       local (DK)
                           _     _ +            +
                        CEEEEHHHHHHHHHCCEEEEEEEEECC
                        CMEAPAGNIDIIKAGMKSPVQLTVKNDT
                                                                                        Sequence Length
                                                                                       Doesn’t Completely
                                                                                           Relate to
                                                                                         Thermostability




                                                                                                                      58 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                       16%
(as fraction of total sequences) (%)




                                       14%
                                                            mesophilic cog
                                       12%                  thermophilic cog

                                       10%
             Frequency




                                                            thermophile
                                       8%                   mesophile
                                       6%

                                       4%

                                       2%

                                       0%
                                             50
                                        0


                                                    5
                                                    3
                                                    0
                                                    5
                                                    3
                                                    0
                                                    5
                                                    3
                                                    0
                                                    5
                                                    3
                                                    0
                                                    5
                                                    3
                                                  11
                                                  18
                                                  25
                                                  31
                                                  38
                                                  45
                                                  51
                                                  58
                                                  65
                                                  71
                                                  78
                                                  85
                                                  91
                                                  98




                                                   Length



    Simple distributions of sequence                                           But this neglects special case of AA
    length have thermophiles shorter                                           (eubacterial thermophile): archeal
              (Eisenberg)                                                      sequences shorter
  Controlling
  for Biases:
   Stratified
    Sample                         Stratified Sampling based on COGs
Correct for




                                                                         59 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
duplications, repeats,
unique families;
Extend COGs to get
52 ortholog families

                                ALL                     Ortho.




    (COGs, Lipman, Koonin)
                             Meso, MT AF OT AA MJ Meso, AA MT AF OT MJ
Controls II: Known Structures, Random Genomes

                                                                         Random Sampling: Make up random
                                                                         thermo. and meso. genomes, see
 3D                                                                      what distribution of each statistic is
 Structures




                                                                                                                                                     60 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                               a
                                                                               b


 For orthologs                                                                 a
                                                                               b
                                                                               c

 of known
 structure:                                                                    a
                                                                               b


 map tertiary                                   Therm. Meso
                                                                                   Original   Clustered   Skewed Composition   Uniform Composition
                                                 Avg. Avg.
 salt bridges    COG         Cat.       PDB       SB    SB Diff.

 onto multiple   49
                  80
                       J
                       J
                           ribosomal
                            ribosomal
                                        1rss
                                        1aci
                                                 5.6
                                                 0.8
                                                       3.1
                                                        0.7
                                                              3
                                                               0.1
                                                                     +

 alignment        81
                  91
                       J
                       J
                            ribosomal
                            ribosomal
                                        1ad2
                                        1bxe
                                                 6.4
                                                 1.8
                                                        4.3
                                                        0.9
                                                               2.1
                                                               0.9
                                                                     +
                                                                     +                        Meso.                      Therm.
 and look at      93
                  96
                       J
                       J
                            ribosomal
                            ribosomal
                                        1whi
                                        1sei
                                                  3
                                                  2
                                                        1.9
                                                        2.1
                                                               1.1
                                                              -0.1
                                                                     -
 conservation     98
                 184
                       J
                       J
                            ribosomal
                            ribosomal
                                        1pkp
                                        1a32
                                                 0.6
                                                 1.8
                                                        1.7
                                                        1.9
                                                              -1.1
                                                              -0.1

 in Therm. vs.   186
                  16
                       J
                       J
                            ribosomal
                           synthetase
                                         1rip
                                        1pys
                                                 0.4
                                                 7.6
                                                        0.9
                                                        2.6
                                                              -0.5
                                                                5    +
                 124   J   synthetase   1ady     9.6    6.1    3.5   +
 Meso.           162   J   synthetase   2ts1     3.8    3.3    0.5
                  30   J       other    1yub      5     5.3   -0.3
                 125   F       other    1tmk     0.8    0.4    0.4
                 149   C       other    1btm      3     4.3   -1.3   -
                 541   N       other     1fts    3.6    3.4    0.2
                 112   E       other    1cj0     6.2    4.6    1.6   +
                 552   N       other     1ffh    4.2    4.6   -0.4



                                                                                                          LOD
                                                         End of class on 11.27




61 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
How Representative are the                                                                           12%
                                                                                                                                                                                                  FIT

 Known Structures of the




                                               Frequency (as fraction of total sequences)
                                                                                                                                                                                                  SC
                                                                                                     10%
                                                                                                                                                                                                  MJ
                                                                                                                                                                                                  HI


  Proteins in a Complete                                                                             8%                                                                                           MP
                                                                                                                                                                                                  MG
                                                                                                                                                                                                  EC
                                                                                                     6%

Genome? The issue of Bias
                                                                                                                                                                                                  SS
                                                                                                                                                                                                  HP
                                                                                                     4%




                                                                                                                                                                                                            62 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                                     2%
Assess 2º,TM predictions
  (+) comprehensive, statistical                                                                     0%




                                                                                                           50
                                                                                                       0



                                                                                                                 5

                                                                                                                          3

                                                                                                                                  0

                                                                                                                                       5

                                                                                                                                               3

                                                                                                                                                     0

                                                                                                                                                          5

                                                                                                                                                               3

                                                                                                                                                                        0

                                                                                                                                                                                5

                                                                                                                                                                                     3

                                                                                                                                                                                             0

                                                                                                                                                                                                   5

                                                                                                                                                                                                        3
  (-) predictions inaccurate




                                                                                                                11

                                                                                                                     18

                                                                                                                              25

                                                                                                                                      31

                                                                                                                                            38

                                                                                                                                                    45

                                                                                                                                                         51

                                                                                                                                                              58

                                                                                                                                                                    65

                                                                                                                                                                            71

                                                                                                                                                                                    78

                                                                                                                                                                                          85

                                                                                                                                                                                                  91

                                                                                                                                                                                                       98
                                                                                                                                                         Length

     (~65%)
  (-) extrapolate from PDB (esp. TM),
     domain problem
                                                                                                     30%
Is prediction (extrapolation) based on known



                                                        Frequency (as fraction of total sequences)
                                                                                                                                                                            genomes

structures justified?                                                                                25%                                                                    PDB domains
                                                                                                                                                                            whole chains
                                                                                                     20%
Length: Genomes Sequences are longer
than those in Known Structures                                                                       15%


                                                                                                     10%
340 aa for avg. genome seq.
(470 aa for yeast)                                                                                    5%



205 aa for PDB chain                                                                                  0%
                                                                                                                 5
                                                                                                                      3
                                                                                                                              0
                                                                                                                                   5
                                                                                                                                           3
                                                                                                                                                0
                                                                                                                                                     5
                                                                                                                                                          3
                                                                                                                                                               0
                                                                                                                                                                    5
                                                                                                                                                                            3
                                                                                                                                                                                 0
                                                                                                                                                                                         5

                                                                                                                                                                                              >1 3
                                                                                                       0
                                                                                                           50




                                                                                                                                                                                                   5
~160 aa for PDB domain
                                                                                                                11
                                                                                                                     18
                                                                                                                          25
                                                                                                                                  31
                                                                                                                                       38
                                                                                                                                               45
                                                                                                                                                    51
                                                                                                                                                         58
                                                                                                                                                              65
                                                                                                                                                                   71
                                                                                                                                                                        78
                                                                                                                                                                                85
                                                                                                                                                                                     91
                                                                                                                                                                                             98
                                                                                                                                                                                                01
                                                                                                                                                     Length
                                                   Amino Acid Composition
                                                   How Representative are the Known
                                                      Structures of the Proteins in
                                                          Complete Genome?




                                                                                                                                                                            63 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                     ABS.    rms   K I C Q W N F L G A P S R H M E D T Y V
                                     EC             4.4   6.0   1.2   4.4   1.5   4.0   3.9 10.6    7.4   9.5   4.4   5.8   5.5   2.3   2.8   5.7   5.1   5.4   2.9   7.1
                                     HI             6.3   7.1   1.0   4.6   1.1   4.9   4.5 10.5    6.6   8.2   3.7   5.8   4.5   2.1   2.4   6.5   5.0   5.2   3.1   6.7
Name   Soluble   = all-b   + all-   SS             4.2   6.3   1.0   5.6   1.6   4.0   4.0 11.4    7.4   8.5   5.1   5.8   5.1   1.9   2.0   6.0   5.0   5.5   2.9   6.7
        PDB
                                     SC             7.3   6.6   1.3   3.9   1.0   6.1   4.5   9.6   5.0   5.5   4.3   9.0   4.5   2.2   2.1   6.5   5.8   5.9   3.4   5.6
                                     HP             8.9   7.2   1.1   3.7    .7   5.9   5.4 11.2    5.8   6.8   3.3   6.8   3.5   2.1   2.2   6.9   4.8   4.4   3.7   5.6
A        8.40%      6.8%      9.2%   MP             8.6   6.6    .8   5.4   1.2   6.2   5.6 10.3    5.5   6.7   3.5   6.5   3.5   1.8   1.6   5.7   5.0   6.0   3.2   6.5
C        1.72%      1.6%      1.4%   MG             9.5   8.2    .8   4.7   1.0   7.5   6.1 10.7    4.6   5.6   3.0   6.6   3.1   1.6   1.5   5.7   4.9   5.4   3.2   6.1
D        5.91%      5.9%      5.8%   MJ            10.4 10.5    1.3   1.5    .7   5.3   4.2   9.5   6.3   5.5   3.4   4.5   3.8   1.4   2.2   8.7   5.5   4.0   4.4   6.9
E        6.29%      5.2%      7.3%
                                     AVG            7.5   7.3   1.1   4.2   1.1   5.5   4.8 10.5    6.1   7.0   3.8   6.4   4.2   1.9   2.1   6.5   5.1   5.2   3.3   6.4
F        3.94%      4.2%      4.2%
                                     SD             2.3   1.4    .2   1.3    .3   1.2    .8    .7   1.0   1.5    .7   1.3    .9    .3    .4   1.0    .3    .7    .5    .6
G        7.79%      8.4%      6.4%
H        2.19%      2.1%      2.2%
                                     Diff.
I        5.54%      5.4%      5.1%
K        6.02%      5.6%      6.5%   EC       16    -25    8    -29   19     7    -15    -2   28     -6   13     -5    -3   16     3    28     -7   -14    -7   -22    1
L        8.37%      7.3%      9.6%   HI       17     8    27    -38   24    -21    6    12    26    -15    -2   -20    -2    -6    -7   10     5    -17   -11   -14    -4
M        2.15%      1.7%      2.4%   SS       20    -29   13    -39   49     9    -13    1    37     -6    1    11     -3    6    -15    -8    -2   -16    -6   -20    -4
N        4.57%      5.3%      4.4%   SC       21    24    18    -21    5    -27   31    14    15    -36   -34    -7   51     -7    -2    -4    5     -4    0     -8   -20
P        4.70%      5.1%      4.4%   HP       27    52    29    -34    0    -51   27    36    34    -26   -18   -29   14    -28    -4    2    11    -20   -25    1    -20
Q        3.73%      3.5%      4.2%   MP       28    45    18    -55   44    -17   35    41    24    -29   -20   -25    8    -27   -18   -28    -8   -17    2    -11    -7
R        4.78%      4.2%      5.4%
                                     MG       36    61    48    -50   27    -32   62    53    28    -41   -33   -36   11    -35   -28   -30    -8   -18    -8   -11   -12
S        5.97%      7.2%      5.7%
                                     MJ       38    77    88    -23   -61   -49   14     6    14    -19   -35   -28   -25   -20   -35    1    40     -8   -31   20     -2
T        5.87%      7.2%      5.2%
V        6.96%      7.6%      5.7%   AVG            26    31    -36   13    -23   19    20    26    -22   -16   -17    6    -13   -13    -4    4    -14   -11    -8    -9
W        1.46%      1.7%      1.5%   RMS            45    39    38    35    31    30    28    27    25    24    23    21    21    18    18    16    15    15    15    11
Y        3.64%      3.8%      3.5%
                                                                           Structurally Uncharacterized (186)



                                           1      4      3                                3         2            5        6             1          4         2                4


Composition                      1
                                 2
                                        PDB Match (152)
                                        Low Complexity Region (116)
                                                                              3
                                                                              4
                                                                                    TM helix (30)
                                                                                    Linker Region (5)
                                                                                                                 5
                                                                                                                 6
                                                                                                                          Coiled-Coil
                                                                                                                          All-alpha or All-beta Region


of Different                Statistics for Amino Acids
                                                             AVG      SD          EC           HI        HP          MG        MJ        MP        SC            SS


                             Total Number                    775998            1358465        505279    500616   170400       497968    237905 2900670 1033450

 Regions of                  Fraction Masked by...




                                                                                                                                                                      64 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                              PDB Match                       8.7%    3.7%        11.1%       13.7%      8.8%        12.9%     7.1%         9.7%   6.2%          9.0%
                              Non-globular Region            21.7%    6.9%        16.7%       13.9%     22.2%        28.2%    35.1%     24.7%      23.9%         20.5%

 Genomes                      TM-helix
                              Linker Region
                                                              4.9%
                                                              5.1%
                                                                      1.4%
                                                                      0.4%
                                                                                  7.3%
                                                                                  5.3%
                                                                                               6.1%
                                                                                               4.8%
                                                                                                         4.8%
                                                                                                         4.8%
                                                                                                                     3.8%
                                                                                                                     5.0%
                                                                                                                               2.9%
                                                                                                                               5.0%
                                                                                                                                            4.5%
                                                                                                                                            5.2%
                                                                                                                                                   5.2%
                                                                                                                                                   4.6%
                                                                                                                                                                 5.9%
                                                                                                                                                                 5.1%
                             Fraction Remaining
                              Uncharacterized                59.7%    8.9%        59.6%       61.5%     59.4%        50.2%    49.9%     55.8%      60.0%         59.6%
• Are composition                                            AVG      SD          EC           HI        HP          MG        MJ        MP        SC            SS
  differences                 Overall                          23%     10%         16%          17%       27%         36%       38%         28%        21%        20%
                              PDB Match                        18%     9%          12%          14%       24%         27%       34%         20%        12%        15%
  uniform?                    Non Globular Region             36%      13%         32%          33%       39%         50%       52%         40%        42%        35%


• Resampling                  TM-helix

                              Linker Region
                                                              49%
                                                               27%
                                                                       15%

                                                                       10%
                                                                                   55%

                                                                                   22%
                                                                                                53%

                                                                                                24%
                                                                                                          55%

                                                                                                          29%
                                                                                                                      57%

                                                                                                                      39%
                                                                                                                                55%

                                                                                                                                33%
                                                                                                                                            56%

                                                                                                                                            35%
                                                                                                                                                       56%

                                                                                                                                                       21%
                                                                                                                                                                  51%

                                                                                                                                                                  25%

• Non-globular                Uncharacterized Region           23%     6%          15%          17%       26%         34%       32%         27%        20%        19%



  regions differ most   1
                             a
                             b
  in occurrence and          a
  composition           2    b
                             c

• Remove Repetitive     3

  Regions (SEG)         4
                             a
                             b

                        5
PDB    Select    length    class   name               Name Hydroph. Soluble biophys.       Rel.
1sty   -             137 b         Staph nuclease
                                                             Polar    PDB     proteins     Diff.
1cgp   a:9-137       129 b

                                                                                                    Biophysical
                                   CAP
1bgh   -              85 b         Gene V protein
1pht   -              83 b         SH3 domain                         PS        BP       BP/PS -1
1tpf
1wsy
8dfr
       a:
       a:
       -
                     250 /b
                     248 /b
                     186 /b
                                   TIM
                                   Trp Synthase
                                   DHFR
                                                      P        H       4.7%      3.7%       -21%
                                                                                                     Proteins
2rn2   -             155 /b       Ribonuclease H     F        H       4.0%      3.2%       -19%
1brs   d:             87 /b       Barstar
                                                      M        H       2.1%      1.8%       -16%
                                                                                                    Proteins that




                                                                                                                      65 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
1gbs   -             185 +b       Hen Lyzozyme
                                                      D        P       6.0%      5.1%       -16%
119l   -             162 +b       T4 lysozyme
193l   -             129 +b       alpha-Lactabumin   V        H       7.0%      6.2%       -12%    inform our view
7rsa   -             124 +b       RNAse A            C        H       1.7%      1.5%         -9%
1brn   l:            108 +b       Barnase
                                                      S        P       6.0%      5.7%         -5%
                                                                                                    of the folding
1fkd   -             107 +b       FK506
9rnt   -             104 +b       RNAse T1           G        .       7.8%      7.7%         -1%   process -- as
1sha   a:            103 +b       SH2 domain                  H
1ubi   -              76 +b       Ubiquitin
                                                      I
                                                      N        P
                                                                       5.6%
                                                                       4.6%
                                                                                 5.5%
                                                                                 4.6%
                                                                                              -1%
                                                                                               0%
                                                                                                    compared to
1cse   i:             63 +b
1igd   -              61 +b
                                   CI-2 inhibitor
                                   B1 domain          W        H       1.4%      1.5%          1%   the PDB.
1mbd   -             153          Globin             T        P       5.8%      6.0%          2%
1hrc
2wrp
       -
       r:
                     105 
                     104 
                                   Cytochrome c
                                   Trp Repressor
                                                      L        H       8.4%      8.7%          5%   Shorter
1lli
1cop
       a:
       d:
                      89 
                      66 
                                   Cro Repressor
                                   Lambda Repressor
                                                      A
                                                      Y
                                                               .
                                                               .
                                                                       8.4%
                                                                       3.7%
                                                                                 8.8%
                                                                                 3.9%
                                                                                               6%
                                                                                               6%
                                                                                                    (116 v 161)
1rpo   -              61          ROP
                                                      H        P       2.2%      2.4%          6%
1myk   a:             47          Arc Repressor
                                                      Q        P       3.7%      4.0%          6%
                                                                                                    Fewer
2zta   a:             31          GCN4 zipper
1btl   -             263 M         beta-Lactamase     R        P       4.8%      5.2%          9%   hydrophobes
1bpi   -              58 S         BPTI
                                                      E        P       6.2%      7.0%        13%

AVG                  116                              K        P       5.9%      7.7%        30%
     Large-scale Database Surveys
              (contents)




                                                                    66 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
                                                         nesg.org




          G Montelione




67 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  Finding Unusual                                                         • Prospective Target
                                                                          Selection
 Proteins for Expt.                                                       • Identify Proteins in
Structural Genomics                                                       M. genitalium that are most
                                                                          atypical structurally (hardest)
                                                                          •Characterize biophysically by




                                                                                                       68 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                            M. gen. ORFs
                            (from TIGR)                         23
                           483 (468, 479)                                 CD (do they fold normally?)
PDB match, TM-region
  or low-complexity
                               Structurally
                             Uncharacterized
                                                                                           L Regan
         281
                                    202
                  Func. Annotated           No Functional
                   (TIGR, NCBI)              Annotation
                        132
                                                 70
                           Full-length Domains          NOT Full-length
                                                             47
                                      23
            Clonable from                         No homologs
             homologs in
           another organism                             12
                   11
        E. coli         B. subtilis         Clonable      Not Clonable
                                           (no Trp's)           9
        4+2              4+1
                                              3
                                                         Tracking Database




69 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
     Large-scale Database Surveys
              (contents)




                                                                    70 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
•   Fold Library
•   Parts Lists: homologs, motifs, orthologs, folds
•   Overall Sequence-structure Relationships, Annotation Transfer
•   Parts in Genomes, shared & common folds
•   Genome Trees
•   Extent of Fold Assignment: the Bias Problem
•   Bulk Structure Prediction
•   The Genomic vs. Single-molecule Perspective
•   Understanding Biases in Sampling
•   Relationship to experiment: LIMS, target selection
•   Function Classification
•   Cross-tabulation, folds and functions
 Adding Structure to                      Why Structure?
Functional Genomics,                    Do we really need it?
Function to Structural                1 Most         1      2   3   4   5   6   7   8   9   10 11   12 13   14 15 16   17 18 19   20 …




     Genomics                           Highly
                                        Conserved




                                                                                                                                         71 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                      2 Precisely
                  Folds v.              Defined
        %ID v.   Genomes                Modules
        RMS
                                      3 Seq. 
                                        Struc.
?




                                                     RMSD
                                        Clearer
                                        than Seq.
                        Purely
        ?
                         Seq.           Func.
                                                                                            %ID
                        Based
                       Analysis --    4 Link to
                       e.g. EcoCyc,
    Function            ENZYME,         Chemistry,                                                                     Drug
                       GenProtEC,
                       COGs, MIPS       Drugs
GenProtEC -      the E. coli database
 Functional      http://genprotec.mbl.edu/start
Classification




                                                  72 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                           Functional
                                          Classification
COGs                                         ENZYME




                                                               73 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
(cross-org.,                                 (SwissProt
just conserved, NCBI
                       GenProtEC             Bairoch/
                       (E. coli, Riley)
Koonin/Lipman)                               Apweiler,
                                             just enzymes,
                                             cross-org.)


                                             Also:
                                             Other
                                             SwissProt
                                             Annotation

 “Fly”                 MIPS/PEDANT
                                             WIT, KEGG
                                             (just pathways)
 (fly, Ashburner)      (yeast, Mewes)
 now extended to                             TIGR EGAD
                                             (human ESTs)
 GO (cross-org.)
              Hierarchy of Protein Functions
                                                         All of SCOP entries



                                    ENZYME                                      NON-ENZYME


                          1                          3                            1                          3




                                                                                                                                      74 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                       Oxido-                 Hydrolases                         Meta-                      Cell
                     reductases                                                 bolism                   structure



                 1.1       1.5          3.1               3.4            1.2         1.1             3.1         3.8
              Acting on Acting on    Acting on       Acting on        Nucleotide    Carb.           Nucleus    Extracel.
               CH-OH     CH-NH         ester          peptide          metab.       metab.                      matrix
                                      bonds            bonds


            1.1.1                         3.1.1                                   1.1.1                                3.8.2
          NAD and                       Carboxylic                                                                   Extracel.
                                          ester                                 Polysach.
           NADP                          hydro-                                  metab.                               matrix
          acceptor                       lases                                                                        glyco-
                                                                                                                      protein



1.1.1.1          1.1.1.3     3.1.1.1             3.1.1.8              1.1.1.2            1.1.1.1         3.8.2.1           3.8.2.2
Alcohol          Homo        Carboxyl            Choline              Starch             Glycogen         Fibro-           Tenascin
dehydro          serine      esterase            esterase             metab.             metab.           nectin
                dehydro
genase          genase



     Precise functional similarity                        General similarity                   Functional class similarity
                                                                                                                                  Fold, Localization,
   Can we define FUNCTION?                                                                                                          Interactions &
                                                                                                                                    Regulation are
                                                                                                                                  attributes of proteins that
Problems defining function:                                                                                                        are much more clearly
Multi-functionality: 2 functions/protein (also 2 proteins/function)                                                                         defined
Conflating of Roles: molecular action, cellular role, phenotypic




                                                                                                                                                           75 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
      manifestation.
Non-systematic Terminology:
      ‘suppressor-of-white-apricot’ & ‘darkener-of-apricot’

                                               Functional
                                              Classification
       COGs                                      ENZYME
       (cross-org.,
       just conserved,
                           GenProtEC
                           (E. coli, Riley)
                                                 (SwissProt
                                                 Bairoch/
                                                                                                                            vs.
                                                                   24 (c) Mark Gerstein, 2000, Yale, bioinfo.mbb.yale.edu
       NCBI                                      Apweiler,
       Koonin/Lipman)                            just enzymes,
                                                 cross-org.)


                                                 Also:
                                                 Other
                                                 SwissProt
                                                 Annotation

        “Fly”              MIPS/PEDANT
                                                 WIT, KEGG
                                                 (just pathways)
        (fly, Ashburner)   (yeast, Mewes)
        now extended to                          TIGR EGAD
                                                 (human ESTs)
        GO (cross-org.)
  Fold-Function
 Combinations #1

Many Functions on the




                         76 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
Same Fold
-- e.g. the TIM-barrel




Two Different Folds
Catalyze the Same
Reaction -- e.g.
Carbonic Anhydrases
(4.2.1.1)
       91 Enzymatic Functions
           + Non-Enzyme
                                                               Combinations
                                                               Fold-Function
                                                                     331 Observed

                                                         229 Folds
                                                                     ~20K (=92x229) Possible,




77 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                      The Most                                                                                                                                                                                                                                                                                        Top Multifunctional Folds 

                                   Versatile Folds,                                                                                                                                                                                                                                                                           16        9       6       6       6       5       4       4       4       3       3       3       3       3       3       3




                                                                                                                                                                                                                                                                                                                                                                                      1ama

                                                                                                                                                                                                                                                                                                                                                                                              1bdo
                                                                                                                                                                                                                                                                                                                                                                      1phc
                                                                                                                                                                                                                                                                                                                              1byb




                                                                                                                                                                                                                                                                                                                                                      1gky




                                                                                                                                                                                                                                                                                                                                                                              3chy




                                                                                                                                                                                                                                                                                                                                                                                                              1snc
                                                                                                                                                                                                                                                                                                                                      2ace




                                                                                                                                                                                                                                                                                                                                                                                                                                      1imf
                                                                                                                                                                                                                                                                                                                                                                                                      1jbc




                                                                                                                                                                                                                                                                                                                                                                                                                              3pte



                                                                                                                                                                                                                                                                                                                                                                                                                                              1fha
                                                                                                                                                                                                                                                                                                                                                              1fxd




                                                                                                                                                                                                                                                                                                                                                                                                                      1lxa
                                                                                                                                                                                                                                                                                                                                              1xel




                                                                                                                                                                                                                                                                                                                                                                                                                                                      1rie
                                  Versatile Functions




                                                                                                                                                                                                                                                                                                                             3.001
                                                                                                                                                                                                                                                                                                                                     3.048
                                                                                                                                                                                                                                                                                                                                             3.018
                                                                                                                                                                                                                                                                                                                                                     3.024
                                                                                                                                                                                                                                                                                                                                                             4.031
                                                                                                                                                                                                                                                                                                                                                                     1.063
                                                                                                                                                                                                                                                                                                                                                                             3.013
                                                                                                                                                                                                                                                                                                                                                                                     3.045
                                                                                                                                                                                                                                                                                                                                                                                             2.055
                                                                                                                                                                                                                                                                                                                                                                                                     2.018
                                                                                                                                                                                                                                                                                                                                                                                                             2.024
                                                                                                                                                                                                                                                                                                                                                                                                                     2.053
                                                                                                                                                                                                                                                                                                                                                                                                                             5.003
                                                                                                                                                                                                                                                                                                                                                                                                                                     5.007
                                                                                                                                                                                                                                                                                                                                                                                                                                             1.021
                                                                                                                                                                                                                                                                                                                                                                                                                                                     7.035
                                                                                                                                                                                                                                                                                                            NONENZ   0.0.0    22        5 40 666 374                         168              11 464 105                1       1       7 102




                                                                                                                                                                                                                                                                                                                                                                                                                                                             78 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                                                                                                                                                                                                                                                                     1.1.1   106     266
                                                                                                                                                                                                                                                                                                                     1.1.3     4                  1
                                                                                                                                                                                                                                                                                                                     1.10.2                                                                                                                             5
                                                                                                                                                                                                                                                                                                                     1.11.1        4
                                                                                                                                                                                                                                                                                                                     1.14.13                      3                                                                                             6
                                                                                                                                                                                                                                                                                                                     1.14.14 21                  50
                                   Top-4 Functions:                                                                                                                                                                                                                                                                  1.14.15                      2
                                                                                                                                                                                                                                                                                                      OX             1.14.99                      7
                                   Glycosidases, carboxy-                                                                                                   Top-5 Folds:                                                                                                                                             1.17.4
                                                                                                                                                                                                                                                                                                                     1.18.6               42
                                                                                                                                                                                                                                                                                                                                                                                                                                              36

                                                                                                                                                                                                                                                                                                                     1.3.1    15      82   3
                                   lyases, phosphoric                                                                                                       TIM-barrel (16),                                                                                                                                         1.3.99
                                                                                                                                                                                                                                                                                                                     1.6.5
                                                                                                                                                                                                                                                                                                                              10
                                                                                                                                                                                                                                                                                                                                               2
                                   monoester hydrolases,                                                                                                    alpha-beta hydrolase fold (9),                                                                                                                           1.6.99
                                                                                                                                                                                                                                                                                                                     1.9.3
                                                                                                                                                                                                                                                                                                                               8       2             4
                                                                                                                                                                                                                                                                                                                                                                                                                                                        6
                                                                                                                                                                                                                                                                                                                     2.1.3                     6            1
                                   linear monoester                                                                                                         Rossmann fold (6), P-loop                                                                                                                                2.3.1
                                                                                                                                                                                                                                                                                                                     2.6.1
                                                                                                                                                                                                                                                                                                                                   6
                                                                                                                                                                                                                                                                                                                                                       128
                                                                                                                                                                                                                                                                                                                                                                                                                        8
                                                                                                                                                                                                                                                                                                      TRAN
                                   hydrolases (3.2.1, 4.2.1                                                                                                 NTP hydrolase fold (6),                                                                                                                                  2.7.1
                                                                                                                                                                                                                                                                                                                     2.7.4               291 156
                                                                                                                                                                                                                                                                                                                                                           10

                                                                                                                                                                                                                                                                                                                     2.7.7                                                                                                              1
                                   3.1.3, 3.5.1)                                                                                                            Ferrodoxin fold (6)                                                                                                                                      3.1.1
                                                                                                                                                                                                                                                                                                                     3.1.2
                                                                                                                                                                                                                                                                                                                                 122
                                                                                                                                                                                                                                                                                                                                   3
                                                                                                                                                                                                                                                                                                                                                    12          1

                                                                                                                                                                                                                                                                                                                     3.1.3                                                                                                            77
                                                                                                                                                                                                                                                                                                                     3.1.31                                        4
                                                                                                                                                                                                                                                                                                                     3.1.4     4
                                                                  A                                       B                                                               A/B                                                                 A+B                                 MULTI SML                          3.2.1   170                              121
                                                                                                                                                                                                                                                                                                                     3.2.3     3
                                                                                                                                                                                                                                                                                                      HYD            3.4.11        2
                                                                                                                                                                                                                                                                                                                     3.4.16        4                                                                                           1
                                            d1mmog_




                                            d1gpma2


                                            d1masa_




                                            d1mkaa_




                                                                                                                                                                                                                                                                                                                     3.5.2                                                                                                   142
                                            d1dcoa_
                                            d1occh_




                                            d1nbaa_
                                            d1occe_




                                            d1caua_




                                            d2kaua_


                                            d3rubs_




                                            d1fjma_
                                            d2gsta1




                                            d1rvva_




                                            d1pya.1
                                            d1alka_

                                            d1ttqb_
                                            3pgm




                                                                                                                                                                                                                                                                                                                     3.5.4     5
                                            1ama




                                            1mut
                                            1bdo
                                            1dud



                                            1udh




                                            2hnq

                                            1pdo
                                            1phc



                                            1poc




                                            3pgk
                                            1hcb



                                            1byb
                                            1enh




                                            2eng




                                            1gky




                                            2baa
                                            2abk


                                            1vnc




                                            1snc




                                            3chy




                                            1cde




                                            1agx
                                            1aac




                                            2ace


                                            1xaa




                                            2cae
                                            1phr




                                            1opr




                                            1mrj




                                            1imf
                                            1arb




                                            1ulb




                                            1dtp




                                            1hip
                                            1tml




                                            1hqi
                                            1nyf




                                            1srx




                                            3pfk
                                            1fha




                                            1fps


                                            1jbc




                                            1fus




                                            1iba

                                            1lba
                                            1fxd
                                            1gai




                                            1hcl
                                            1xel




                                            1ayl




                                            1rpl
                                            1tpt
                                            1llp




                                            2sil




                                                                                                                                                                                                                                                                                                                     3.6.1                    14                  40
                                                                                                                                                                                                                                                                                                                     3.7.1         2
                                            1.004
                                            1.019
                                            1.021
                                            1.034
                                            1.037
                                            1.053
                                            1.054
                                            1.061
                                            1.063
                                            1.068
                                            1.070
                                            1.077
                                            1.080
                                            2.005
                                            2.018
                                            2.020
                                            2.024
                                            2.029
                                            2.033
                                            2.043
                                            2.047
                                            2.053
                                            2.055
                                            2.056
                                            3.001
                                            3.002
                                            3.009
                                            3.011
                                            3.013
                                            3.018
                                            3.021
                                            3.024
                                            3.028
                                            3.029
                                            3.030
                                            3.037
                                            3.040
                                            3.041
                                            3.043
                                            3.045
                                            3.046
                                            3.047
                                            3.048
                                            3.049
                                            3.054
                                            3.055
                                            3.057
                                            3.061
                                            3.064
                                            3.065
                                            3.066
                                            4.001
                                            4.002
                                            4.005
                                            4.020
                                            4.031
                                            4.035
                                            4.036
                                            4.049
                                            4.058
                                            4.060
                                            4.073
                                            4.082
                                            4.084
                                            4.086
                                            4.087
                                            5.001
                                            5.004
                                            5.005
                                            5.007
                                            5.009
                                            7.029
Top Multifold Functions 




                                                                                                                                                                                                                                                                                                                     3.8.1         3
                                                                                                                                                                                                                                                                                                                     4.1.1    28                         1
                            150 0.0.0       #    # #                                      # # # # # # 8         6 1 #         #             # #         #         #                       1 5                 1                       #       4                               #               7   #
                                                                                                                                                                                                                                                                                                                     4.1.2    58
                              7   3.2.1                               4                           #       1 #                 # 3                                                                                         #                                                                           LY             4.1.3     4                     1
                              7   4.2.1                                                                         # 1           #                 #                                                         #                       2           2                                                                      4.1.99                              7
                              6   3.1.3                                                                                                                     9 7           5                           4                                                               #                       #
                                                                                                                                                                                                                                                                                                                     4.2.1    48      15                                                                                1
                              6   3.5.1                                                                                   #                         1                             1                           #               #                           2
                              5   1.11.1                      #               1                                                                                   #                           4                                                                                       #                              5.1.3            25
                              5   1.9.3                   3                       5           #                                                                                                                                                                                                                      5.3.1   382
                                  2.7.1
                              5
                              5   3.6.1                                                               #
                                                                                                                      #
                                                                                                                          #
                                                                                                                                                                      3                                           #
                                                                                                                                                                                                                                      #
                                                                                                                                                                                                                                                  1
                                                                                                                                                                                                                                                      3
                                                                                                                                                                                                                                                                                  #
                                                                                                                                                                                                                                                                                                      ISO            5.3.3                     1
                              5   4.1.1                                                                                       #                                                       1                               8                   #                       1
                                                                                                                                                                                                                                                                                                                     5.4.3                               1
                              4   1.14.13       2 6                       3                                                                                                                                                                                   2                                                      5.4.99                          1
                              4   2.4.2                                                                                                                                       #           #                                                                                   4           7                          6.3.2                                                                                                                              5
                              4   2.5.1               #                               3                                             #                                                                                                                                                                                6.3.3                 9
                              4   3.1.1                                                   #       1                                         #                                                 #                                                                                                       LIG            6.3.4                17
                              4   3.2.2                           1                                                                     #                                                         1                                                                       #
                              3   1.1.1                                                                                       #               #                                                                                                                                                                      6.4.1.                                 6
                              3   1.3.1                                                                                       #               #         3
                              3   1.6.99                                                                                      8             4 2
Most Versatile Folds – Relation to Interactions




                                                            79 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                        Similar results
                                         Martin et al.
                                           (1998)




                                        The number of
                                        interactions for
                                        each fold = the
                                           number of
                                        other folds it is
                                            found to
                                         contact in the
                                              PDB
Fold-Function                                   NONENZ
                                                                      A
                                                                          34
                                                                                 B
                                                                                     30
                                                                                                A/B
                                                                                                      14
                                                                                                             A+B
                                                                                                                   28
                                                                                                                          MULTI
                                                                                                                                   4
                                                                                                                                       SML
                                                                                                                                                26
                                                                                                                                                     sum
                                                                                                                                                       136
                                                OX                        13          5               17            3              4             5       47
Combinations                                    TRAN
                                                HYD
                                                                           3
                                                                           4
                                                                                      3
                                                                                     11
                                                                                                      16
                                                                                                      30
                                                                                                                    8
                                                                                                                   18
                                                                                                                                   5
                                                                                                                                   4
                                                                                                                                                         35
                                                                                                                                                         67
                                                LY                         2          3               13            5                                    23
   Cross-                                       ISO
                                                LIG
                                                                           1          2
                                                                                      1
                                                                                                       7
                                                                                                       2
                                                                                                                    4
                                                                                                                    3
                                                                                                                               2
                                                                                                                               1
                                                                                                                                                         16
                                                                                                                                                          7
                                                sum                       57         55               99           69         20                31     331
 Tabulation




                                                                                                                                                           80 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
  Summary                                                        3
  Diagram                                                                                                           SCOP




                                                                                                                                        MULTI


                                                                                                                                                     SML
                                                                                          A            B           A/B      A+B


                                                                          NONENZ          7.1          5.7         7.1       9.2        2.8          0.7


                                                                 ENZYME    OX             3.5          2.1         9.2       2.1        0.7          0.7


                                                                           TRAN           0.7                      10.6      1.4        1.4          0.7


                                                                           HYD            2.8          2.8         6.4       5.7        1.4


                                                                            LY            2.1                      4.3


                                                                           ISO            0.7          1.4         2.8       0.7


                                                                           LIG                                     1.4       1.4


[ Similar analysis in Martin et al. (1998), Structure 6: 875 ]
    Compare Classifications and Genomes
                                                                                                                  SCOP
  Compare 1 Structure-                                                                                                                                                                        CATH (Thornton)




                                                                                                                                         MULTI


                                                                                                                                                         SML
                                                                                         A         B             A/B          A+B
  Function Cross-Tab for                                                                                                                                                                                                          CATH
                                                                                                                                                                                                                         A         B   AB
 Different Genomes and                                                     NONENZ        7.1       5.7           7.1               9.2   2.8             0.7

                                                                                                                                                                                                        NONENZ              10         9.0         15
                                                                            OX           3.5       2.1           9.2               2.1   0.7             0.7
  Different Functional &                                                                                                                                                                                     OX          5.1           5.1         10




                                                                                                                                                                                                                                                                 81 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                  ENZYME
                                                                            TRAN         0.7                     10.6              1.4   1.4             0.7

 Structural Classifications




                                                                                                                                                                                              ENZYME
                                                                                                                                                                                                         TRAN                          1.3         13
                                                                            HYD          2.8       2.8           6.4               5.7   1.4
                                                                                                                                                                                                         HYD             2.6           1.3         14
  for the Yeast Genome                                                       LY          2.1                     4.3
                                                                                                                                                                                                             LY                        2.6         1.3

                                                                            ISO          0.7       1.4           2.8               0.7
                                                                                                                                                                                                             ISO         1.3           1.3         5.1

     Number of folds in the different functional                            LIG                                  1.4               1.4
                                                                                                                                                                                                             LIG                                   1.3
                    categories                                                       Number of folds in the different functional
                                                                                                    categories
      35
                                                                                    8
      30                                                                            7
      25
      20
                                                                                    6
                                                                                    5
                                                                                                                                                                                 MIPS YFC (Mewes)
      15                                                                            4
       10                                                                           3
        5                                                                            2                                                                                                                                            SCOP
                                                                                     1




                                                                                                                                                                                                                                                   MULTI
        0




                                                                                                                                                                                                                                                           SML
                                                                                     0                                                                                                                             A    B        A/B     A+B
            A

                B

                       A/B




                                                                                         A
                               A+B




                                                                                               B

                                                                                                    A/B
                                         MULTI




                                                                Both
                                                                                                           A+B
                                                   SML




                                                                                                                     MULTI

                                                                                                                              SML
                                                                                                                                                                                        metabolism       1        3.5   2.3      10          4.5   1.3     0.8
                                                                ENZ                                                                              Both

SwissProt
     Number of folds in the different functional
                    categories
                                                                nonENZ        wormNumber of folds in the different functional
                                                                                                                                                 EC
                                                                                                                                                 nonEC
                                                                                                                                                                                          energy


                                                                                                                                                                                        growth, div.,
                                                                                                                                                                                         DNA syn.
                                                                                                                                                                                                         2

                                                                                                                                                                                                         3
                                                                                                                                                                                                                  1.1


                                                                                                                                                                                                                  4.9
                                                                                                                                                                                                                        1.2


                                                                                                                                                                                                                        3.6
                                                                                                                                                                                                                                  5


                                                                                                                                                                                                                                  4
                                                                                                                                                                                                                                             1.5


                                                                                                                                                                                                                                             4.5
                                                                                                                                                                                                                                                   0.3


                                                                                                                                                                                                                                                   1.8
                                                                                                                                                                                                                                                           0.2


                                                                                                                                                                                                                                                           1.2




                                                                                                                                                                 MIPS Functional Cat.
     14                                                                                          categories                                                                             transcription    4        1.5   1.3      2.2         1.5   0.5     0.8
                 A          B A/B A+B MULTI SML TOTAL
     12                                                                            30
                                                                                                                                                                                          protein
                                                                                                                                                                                                         5         1    0.9      0.7         1.3   0.3     0.2
      Both       4         13   9   6    2    1    35                                                                                                                                    synthesis
     10
      ENZ       12          6 34 28     11    2    93                              25                                                                                                      protein
                                                                                                                                                                                         targetting      6        1.2   1.7       2          1.6   0.5     0.3
       8
      nonENZ    30         17   5 22     2   25 101                                20                                                                                                     transport
       6                                                                                                                                                                                 facilitation    7        0.9   0.5      0.7         0.6   0.4

       4                                                                           15                                                                                                   intracellular
                                                                                                                                                                                          transport      8        1.8   2.1      1.6         0.6    1
       2                                                                           10
                                                                                                                                                                                          cellular
       0                                                                            5                                                                                                   biogenesis       9        0.9   0.7      1.2         0.3   0.3     0.1
            A

                B

                     A/B




                                                                                                                                                                                           signal
                                                                                    0                                                                                                                   10
                             A+B




                                                                                                                                                                                                                   1    1        1.1         0.3   0.7     0.3
                                     MULTI




                                                                                                                                                                                        transduction
                                                         Both
                                                 SML




                                                                                         A

                                                                                               B

                                                                                                   A/B




Yeast                                                                                                                                                   Both
                                                                                                          A+B




                                                                                                                                                                                        cell rescue,
                                                                                                                                                                                                        11        1.5   1        2.6         1.9   0.7     0.5
                                                                                                                   MULTI




                                                                                                                                                                                         defense…
                                                                                                                             SML




                                                         ENZ

                                                         nonENZ
                                                                              E. coli                                                                   ENZ

                                                                                                                                                        nonENZ
                                                                                                                                                                                           ionic
                                                                                                                                                                                        homeostatis     13        0.5   0.3      0.4         0.4   0.2
                                  COGs vs SCOP: Different Structure
                                   Function Relationships for Most
                                        Conserved Proteins




                                                                                                                                                                82 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                      SCOP                                                                             SCOP




                                                                 MULTI




                                                                                                                                                  MULTI
                                                                         SML




                                                                                                                                                          SML
                                         A     B     A/B   A+B                                                            A     B     A/B   A+B


                                    C    2.2   2.6   4.8    3    0.4                                                  C               7.2   2.9


                                    E    2.2   1.1   7.4   2.6   0.7                                                  E   1.4         1.4   1.4
                     Metabolism




                                                                                                         Metabolism
                                    F    1.1         3.7   1.8                                                        F               2.9




                                                                               Most Conserved COGs
                                    G    0.4   0.4   3.3   0.7                                                        G               4.3   1.4
All Yeast COGs




                                    H    1.1   0.7   4.8    3                                                         H   1.4   2.9         1.4


                                    I    0.7   0.7   2.2   0.4   0.4                                                  I
                 Processing




                                                                                                     Processing
                 Information




                                                                                                     Information
                                    J    2.2   1.8    3     3    0.4     0.4                                          J   8.7   7.2   7.2   10    1.4     1.4
                  Storage &




                                                                                                      Storage &
                                    K                1.1   0.4                                                        K

                                    L    1.1         1.5   1.1   1.1                                                  L                           1.4


                                    M          0.4   0.4   0.7                                                        M
                   Processes




                                                                                                       Processes
                    Cellular




                                                                                                        Cellular

                                    N    1.8   0.7   0.4   0.7           0.4                                          N   1.4         1.4


                                    O    1.5   1.1    3    2.2   0.4     0.4                                          O   2.9         7.2   2.9


                                    P          0.4   1.1   0.7   0.4                                                  P         1.4         2.9   1.4




                                   (Scop, Murzin, Ailey, Brenner, Hubbard, Chothia; COGs, Tatusov, Koonin, Lipman)
   Fold-Function
  Combinations #2
 Many Functions on the
 Same Fold




                                                                                                 83 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
 -- e.g. the TIM-barrel
 at what degree of divergence?
 Sequence Diverg. (%ID, Pseq)
 Structural Diverg. (RMS, Pstr)
 Functional Diverg. (%SameFunc)
   Compare large number of pairs of sequences that have same fold but different functions.
                                                                                         Same
89%                 Human          TP Isomerase           5.3.1.1                        Exact
                    Chick          TP Isomerase           5.3.1.1                        Func.
  45%               E coli         TP Isomerase           5.3.1.1                 Both
                    E coli         PRA Isomerase          5.3.1.24                Class 5
                    B ster.        Xylose Isomerase       5.3.1.5
                                                                            Completely
    ~20%            E coli         Aldolase               4.1.3.3           Different
                    Yeast          Enolase                4.2.1.11
                    Rat            K-channel B-sub.       NON-ENZ                        Same
                                                                                         Exact
                    Photobact.     Flavoprotein?          NON-ENZ                        Func.
                                                                          Relationship




                                                            %SameFunc
                                                                         of Similarity in
                                                                          Sequence &
                                                                          Structure to




                                                                                                                                   84 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
                                                                             that in
                                                                            Function
                                   %ID                                                 TZ
                                                                                                100




                                                                                                      % of pairs with same class
                                                                                                90

                                     %SameFunc                                                  80
                                                                                                70




                                                                                                               or function
                                                                                                60

See at what %ID have diff.                             MIPS Class                               50

                                                       MIPs Precise Func.                       40
function (both broad & precise).                                                                30
                                                       GenProtEC Class
Use 4 func. classifications --                         GenProtEC Func.
                                                                                                20

ENZYME, FLY (+extra), MIPS,                                                                     10
                                                                                                0
GenProtEC                                        100   80               60   40   20        0

                                                            % sequence identity
                                                                                       %ID
 Relationship of
   Similarity in
   Sequence &
Structure to that in




                       85 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu
     Function

								
To top