Slide 1 - CABM Structural Bioinformatics Laboratory

Document Sample
Slide 1 - CABM Structural Bioinformatics Laboratory Powered By Docstoc
					           Arabidopsis genome


                 John Markley


Eldon Ulrich (bioinformatics team leader)


    Center for Eukaryotic Structural Genomics (CESG)
Why CESG chose Arabidopsis thaliana as its initial target genome
• Genome is relatively uncomplicated, but likely
  to contain new folds (relatively unstudied)
    29,000 genes
• Complete expression profiles recently
  published (all ORFs)
• Large field of opportunity for elucidating new
  fold-function relationships
    Synergy with the NSF 2010 Program, which
    plans to determine the functions of all
    Arabidopsis proteins
• Genome codes for proteins that carry
  fascinating biological and biochemical
  processes
    Cell signaling; tissue remodeling; novel
    defense mechanisms; regulation of the cycle
    of yearly cellular processes; biosynthesis of
    natural products
• Opportunity to collaborate with the large,
  highly organized, Arabidopsis community
Where does the Arabidopsis thaliana genome fit in?
           Arabidopsis: some round numbers

~27,000 ORFs

~ 6,000 “known” protein with annotated function

~ 3,000 known to exist but function “unknown”

~ 10,000 “putative” supported by minimal evidence

~ 5,000 “hypothetical” (pure prediction)


Of the above: ~ 5,000 membrane proteins
              40-50% show homology to human proteins
              88% of Arabidopsis superfamilies in human
                 genome (C. Chothia)
CESG target selection goals
Maximize scientific impact
  – ‘Unique structures’ (NIH priority)
  – Fold-function relationships
  – Arabidopsis community requests

Minimize production challenges
   – PCR
   – Cloning
   – Expression
   – Solubility
   – Purification

Minimize sample challenges
   – Aggregation
   – Disorder
 Bioinformatics collaborations: target selection
• *Keith Dunker & Christopher Oldfield - PONDR protein disorder
  predictions (Indiana University)

• Dmitrij Frishman - PEDANT database ORF annotation (Institute for
  Bioinformatics, GSF)

• *Michal Linial, Elon Portugaly, & Ilona Kifer - PROTOMAP/ArabiNet
  protein clustering and new fold probabilities; domain analysis (The
  Hebrew University)

• Christine Orengo – New fold probabilities and domain predictions
  (University College, London – Midwest Structural Genomics
  Consortium)

• Sue Rhee – TAIR (The Arabidopsis Information Resource)

• Chris Town, Owen White & Steven Salzberg - TIGR – ORF
  predictions and annotations
     Analysis of clusters and their vacant surrounding
     volumes as a metric for novel fold identification

• Sequences are first clustered to form a net.
  Occupied clusters are those for which a
                                               occupied
  structure is known.
• For a given cluster, the vacant surrounding
  volume (VSV) is number of clusters
  traversed before encountering an occupied
  cluster
• In the example shown cluster “A” has a VSV
  of 11.
• All Arabidopsis clusters have been sorted         A
  according to their VSV (the VSV of and
  occupied cluster is zero.

              Michal Linial: ProtoNet, AraNet
         CESG scores based on
        predictors of new folds

              Vacant surrounding
CESG Score                         Cluster members
                   volume

    0                >4                  >2

    1                >2                  >2

    2                >2                  >1

    3                >2                  >0

    4                >1                  >0

    5                >0                  >0

    6                 -1                 >0

    9                 0
      PONDR analysis of intrinsic disorder (ID) in proteins
(Predictor Of Natural Disordered Regions: neural net software)
                   (http://www.pondr.com)
           100%

                                    Arabidopsis
                        80%
  Percent of Proteins




                                    Disordered Proteins
                                    X-ray Proteins w/ Disorder
                        60%
                                    Ordered Proteins

                        40%


                        20%


                        0%
                              0%   10%   20%   30%   40% 50% 60% 70%        80%   90% 100%
                                                     Percent Disordered


                                            C. Oldfield & A. K. Dunker, unpublished
PONDR: Predictors of Naturally Disordered Regions




    Dunker et al. J. Mol. Graph. Model (2001) 19:26
 Disorder prediction scoring

                                    Residues in longest
CESG Score   Predicted % disorder
                                    disordered region

    0                ≥0                      -

    4                >30                     -

    5                >35                     -

    6                >40                     -

    7                >45                     -

    9                >50                     -

    9                >30                   ≥40
1H-15N   HSQC NMR spectra of At1g23750
       in folded and unfolded states




 His tag present         His tag removed
Correlations between predicted disorder and 1H-15N HSQC NMR
                        CESG      Consensus   Predicted   Number of     Longest
              HSQC     disorder    Disorder    percent    disordered   disordered
  AGI-ID      result    score       score     disorder     residues     segment
  At5g42290      -         9         100          73          80           45
  At5g24660      -         9         100          71          67           53
  At2g25720      -         9         100          65          76           51
  At1g32310      -         9         50           63          63           22
  At5g12030      -         9         75           63          98           47
  At2g33690      -         9         100          63          45           32
  At3g60650      -         9         100          62          66           35
  At1g23750     +/-        6          0           41          56           23
  At2g20490      -         5         50           38          24           24
  At3g02790      -         5         100          38          40           15
  At5g42990      -         4         50           35          56           37
  At1g29250      ~         4         50           35          45           21
  At3g03410      +         0          0           26          34           13
  At5g25570      -         0          0           26          32           20
  At2g03870      -         0          0           20          20           13
  At1g65480      +         0          0           20          35           20
  At2g43510      +         0          0           17          15           14
  At3g01050      +         0          0           15          18           10
  At5g22580      +         0          0           10          11            7
  At1g24000      +         0          0           10          12            7
  At3g17210      +         0          0           8           9             7
  At3g51030      +         0          0           5           6             6
  At3g16450      +         0          0           5           16           12
  Additional target list manipulations

“Black List: ORFs determined to contain
 transposable elements (Ken Johnson and Lucia
 Alvarado; TIGR annotation)


Manual curation (Craig Newman and Jason Lee)
  − Size
  − cDNA results
  − Number of introns
  − Others
            CESG’s approach to ordering targets




1       0     0-8    0      0     0     0-1    <5     0     0*    ≤7     <5     ≤5        0
2                                       2-4           1    1-2*                 >5
3                           1           5-6           2     3*    >7
4                                       7-8    5     >2     4*           ≥5
7                    9      9     9
8                                        9     9      9
9       9     9                                                                           9



    * Positive prediction of a new fold leads to a higher priority (lower tier number),
      provided that the ORF is not otherwise in tier 7 or 8 and provided that a
      homologue of the ORF is not close to structure determination by another SG
      group (diffraction quality crystals or excellent HSQC map).
                   Scored Arabidopsis ORFs
AGI-ID   Total score           Scores for individual criteria
______________________________________________________________
at2g46100   '70000970138090'   7   0   0   0   0   9   7   0   1   3   8   0   9   0
at2g46110   '70700965158090'   7   0   7   0   0   9   6   5   1   5   8   0   9   0
at2g46130   '40700030059600'   4   0   7   0   0   0   3   0   0   5   9   6   0   0
at2g46140   '10000020005460'   1   0   0   0   0   0   2   0   0   0   5   4   6   0
at2g46150   '70091940358060'   7   0   0   9   1   9   4   0   3   5   8   0   6   0
at2g46160   '99091070957560'   9   9   0   9   1   0   7   0   9   5   7   5   6   0
at2g46170   '70099020355090'   7   0   0   9   9   0   2   0   3   5   5   0   9   0
at2g46180   '99000010258900'   9   9   0   0   0   0   1   0   2   5   8   9   0   0
at2g46190   '99000950059090'   9   9   0   0   0   9   5   0   0   5   9   0   9   0
at2g46200   '40000050358900'   4   0   0   0   0   0   5   0   3   5   8   9   0   0
at2g46210   '99099070058090'   9   9   0   9   9   0   7   0   0   5   8   0   9   0
at2g46220   '70000960058000'   7   0   0   0   0   9   6   0   0   5   8   0   0   0
at2g46230   '10000060005490'   1   0   0   0   0   0   6   0   0   0   5   4   9   0
at2g46240   '80000095158900'   8   0   0   0   0   0   9   5   1   5   8   9   0   0
at2g46250   '40000075158900'   4   0   0   0   0   0   7   5   1   5   8   9   0   0
at2g46260   '99000090058090'   9   9   0   0   0   0   9   0   0   5   8   0   9   0
at2g46270   '99000015158900'   9   9   0   0   0   0   1   5   1   5   8   9   0   0
                   Sorted Arabidopsis ORFs

AGI-ID   Total score           Scores for individual criteria
______________________________________________________________
at2g46140   '10000020005460'   1   0   0   0   0   0   2   0   0   0   5   4   6   0
at2g46230   '10000060005490'   1   0   0   0   0   0   6   0   0   0   5   4   9   0
at2g46200   '40000050358900'   4   0   0   0   0   0   5   0   3   5   8   9   0   0
at2g46250   '40000075158900'   4   0   0   0   0   0   7   5   1   5   8   9   0   0
at2g46130   '40700030059600'   4   0   7   0   0   0   3   0   0   5   9   6   0   0
at2g46220   '70000960058000'   7   0   0   0   0   9   6   0   0   5   8   0   0   0
at2g46100   '70000970138090'   7   0   0   0   0   9   7   0   1   3   8   0   9   0
at2g46150   '70091940358060'   7   0   0   9   1   9   4   0   3   5   8   0   6   0
at2g46170   '70099020355090'   7   0   0   9   9   0   2   0   3   5   5   0   9   0
at2g46110   '70700965158090'   7   0   7   0   0   9   6   5   1   5   8   0   9   0
at2g46240   '80000095158900'   8   0   0   0   0   0   9   5   1   5   8   9   0   0
at2g46180   '99000010258900'   9   9   0   0   0   0   1   0   2   5   8   9   0   0
at2g46270   '99000015158900'   9   9   0   0   0   0   1   5   1   5   8   9   0   0
at2g46260   '99000090058090'   9   9   0   0   0   0   9   0   0   5   8   0   9   0
at2g46190   '99000950059090'   9   9   0   0   0   9   5   0   0   5   9   0   9   0
at2g46160   '99091070957560'   9   9   0   9   1   0   7   0   9   5   7   5   6   0
at2g46210   '99099070058090'   9   9   0   9   9   0   7   0   0   5   8   0   9   0
      The Genie module of Sesame

• Primary module for tracking the progress of
  CESG targets
• Creates reports (XML and summary reports)
• Contains information on the annotated
  proteome
• ORFs can be organized into “Work Groups”
  of defined sizes (usually 96 ORFs) based on
  a variety of criteria (including both physical
  and annotation data)
                      Summary of all CESG targets (11/09/03)
                                                                          Tier 1,2 & 3
                                                                               Tier 4

                                                                      5,000
                                                                               Tier 7
CESG priority score




                                                                      10,000 Tier 8



                                                                      15,000

                                                                               Tier 9
                                                                      20,000



                                                                      25,000

                                                       initiated
                        3032 Targets                   progressing
                                              Legend   stalled
                                                       stopped
                                                       not selected
                      CESG pipeline targets (11/09/03)
                                                                       Tier 1,2 & 3
                                                                            Tier 4

                                                                   5,000
                                                                            Tier 7
CESG priority score




                                                                   10,000 Tier 8




                                                                   15,000

                                                                            Tier 9

                                                                   20,000



                                                                   25,000

                                                    initiated
                       1500 Targets                 progressing
                                           Legend   stalled
                                                    stopped
                                                    not selected
CESG’s bottlenecks in optimal target selection

 Maintaining up to date target sequences and annotation
   − links to TAIR and MIPS

  Incorporating new information from collaborators into target
        selection tools
     − Domain predictions
     − New fold predictions
     − Disorder predictions
     − Gene chip results

  Analyzing successes and failures to improve pipeline
        throughput
            Bioinformatic needs
• Accurate data defining the existence of ORFs and
  their sequences

• Protein-protein and/or protein-ligand data

• Protein domain definitions

• Data on techniques for promoting protein
  solubility and folding
 Future goals from our collaboration with
               Michal Linial

• Develop improved methods for predicting new
  folds

• Integrate the Arabidopsis proteome with TrEMBL
  and SwissProt to generate a better clustering of
  Arabidopsis proteins within protein
  superfamilies

• Carry out a new mapping on the basis of
  predicted protein domains

• Include known functional annotation
 Future goals from our collaboration with
               Keith Dunker

• Further improvements to disorder
  predictors

• Investigate predictions for molecular
  recognition sites

• Attempt to develop predictors for protein
  solubility
      Our answers to specific questions

Question 3: What about a Pfam approach? Pfam5000?


Big5000 is a good start, but many functions likely will be
  in other clusters. Another 5000 may need to be traced
  to understand protein function.


In Arabidopsis
2,194 PfamA (Steve Brenner)
25-30% of these have average disorder of >35% (high
  probability of being unfolded by HSQC)
1,450 targets in CESG’s top 3 tiers cover 640 PfamA
Question 4: How to combine the goals of high coverage
  of protein space with the goals of coverage of
  biomedically relevant proteins or complete coverage
  of single genomes?


CESG’s goal has not been “complete coverage” of the
  Arabidopsis genome. Rather our goal has been to
  seek unique folds and to extend our knowledge of
  fold-function relationships. As this genome becomes
  more extensively annotated, this leverage will
  increase. Proteins of biomedical interest should be
  included as targets under separate “rules”.
Question 5: What about membrane proteins and protein
  complexes?


A certain effort should be devoted to these important
  problems, say 10 to 20% of an interested center’s
  resources. Certain PSI centers may have the
  technology and/or intellectual resources to attack
  these problems in an efficient manner.
Question 6: How to foster interactions with the larger
  biology community?


Hold workshops for information exchange. Make
  protocols and materials readily available. Provide
  services, such as protein production and structures-
  on-demand.
Question 9: Do we expect to reach high-throughput
  production of eukaryotic proteins?


This is the goal of our center (CESG). We have
  experienced a long lag in establishing a workable
  pipeline, but we are confident that a high-throughput
  mode is possible with eukaryotic proteins.


Domain analysis will be key to high-throughput with
  eukaryotic proteins
       CESG’s approach to domain analysis

1. Pfam


2. ProtoNet (collaboration with Michal Linial)


3. “Domain salvage” from proteins with domains of
   known structure (collaboration with Christine Orengo
   & Midwest Structural Genomics Consortium)
      In Arabidopsis:
          19,882 domains map to CATH/Pfam but they are
          in proteins that have 12,110 fragments (> 50
          residues) that do not map to CATH/Pfam

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:3/23/2013
language:English
pages:30