Lecture 9.2 Homology and Structural Similarity (What do when by mwv14394

VIEWS: 7 PAGES: 71

									Lecture 9.2:
Homology and Structural Similarity
(What do when you have no structure ...)

Boris Steipe
boris.steipe@utoronto.ca                               http://biochemistry.utoronto.ca/steipe

Departments of Biochemistry and Molecular and Medical Genetics
Program in Proteomics and Bioinformatics
University of Toronto

(This lecture is based in part on a lecture held by Chris Hogue, Toronto, for CBW in 2002)




9.2                                                                                             1
  Concepts
1. Domains are folding units, functional units and units of
   inheritance.
2. Homologous domains have similar structure.
3. Structural similarity can be measured and similar
   domains can be retrieved from databases.
4. Detection of similar folds can provide mechanistic
   explanations.
5. Threading methods can sometimes find similar folds.
6. Ab initio predictions of structure are highly experimental.



  9.2                                                      2
Concept 1:


Domains are
folding units,
functional units, and
units of inheritance.
9.2                3
Domains as units of inheritance -
the PH domain story


Dotlet -
A dotplot of
Pleckstrin (p47)
reveals similarity
between N-and
C terminus !




9.2                                 4
Domains as units of inheritance -
the PH domain story
                #   Matrix: EBLOSUM62
                #   Gap_penalty: 10.0
                #   Extend_penalty: 0.5
                #
                #   Length: 100
                #   Identity:      31/100 (31.0%)
                #   Similarity:    48/100 (48.0%)
                #   Gaps:           6/100 ( 6.0%)

                      6 IREGYLVKKGSVFNTWKPMWVVLLEDG--IEFYKKKSDNSPKGMIPLKGS    53
                        |::|.|:|:|.....||....:|.||. :.:|.......|.|.|.|:|.
                    245 IKQGCLLKQGHRRKNWKVRKFILREDPAYLHYYDPAGAEDPLGAIHLRGC   294

                     54 TLTSPCQDFGKRMF----VFKITTTKQQDHFFQAAFLEERDAWVRDINKA    99
                        .:||...:...|..    :|:|.|..:..:|.|||..:||..|::.|..|
                    295 VVTSVESNSNGRKSEEENLFEIITADEVHYFLQAATPKERTEWIKAIQMA   344



Emboss -
Optimal sequence alignment: 31% identity over ~100 amino acids.

9.2                                                                          5
     Domains as units of inheritance -
     the PH domain story
                                             #   Matrix: EBLOSUM62
                                             #   Gap_penalty: 10.0
                                             #   Extend_penalty: 0.5
                                             #
                                             #   Length: 100
                                             #   Identity:     31/100 (31.0%)
                                             #   Similarity:   48/100 (48.0%)
                                             #   Gaps:          6/100 ( 6.0%)

                                                6 IREGYLVKKGSVFNTWKPMWVVLLEDG--IEFYKKKSDNSPKGMIPLKGS   53
                                                  |::|.|:|:|.....||....:|.||. :.:|.......|.|.|.|:|.


                                !             245 IKQGCLLKQGHRRKNWKVRKFILREDPAYLHYYDPAGAEDPLGAIHLRGC

                                               54 TLTSPCQDFGKRMF----VFKITTTKQQDHFFQAAFLEERDAWVRDINKA
                                                  .:||...:...|..    :|:|.|..:..:|.|||..:||..|::.|..|
                                              295 VVTSVESNSNGRKSEEENLFEIITADEVHYFLQAATPKERTEWIKAIQMA
                                                                                                       294

                                                                                                       99

                                                                                                       344




N-             Human p47            -C
                           N-            Human p47                      -C


     Overlapping alignments may define domain
     boundaries ! We can search a database with this knowledge ...

     9.2                                                                                               6
Domains as units of inheritance -
the PH domain story
      N-                                            Human p47                                    -C




           QuickTime™ and a TIFF (Uncompressed ) decompressor are needed to se e this picture.
                                                                                                 Hits are smoothly
                                                                                                 bounded and extend
                                                                                                 over the entire
                                                                                                 domain.

               486 hits ... etc.



9.2                                                                                                               7
Domains as units of inheritance -
the PH domain story
in           N-                                   Human p47                                      -C
contrast ...
           QuickTime™ and a TIFF (Uncompressed ) decompressor are needed to se e this picture.




                                                                                                      Hits extend over the
                                                                                                      entire domain. PSI
                                                                                                      Blast would be
                      (Yeast only, for clarity)                                                       difficult ...




9.2                                                                                                                     8
Concept 2:

Homologous
domains have
similar
structure.
9.2            9
Homologous domains have
similar structures




1PLS/2DYN:
             1PLS - PH domain     2DYN - PH domain
23% ID       (Human pleckstrin)    (Human dynamin)



9.2                                                  10
Homology and Structural
Similarity
Proteins
that diverge
in evolution
maintain
their global
fold !

               Russell et al. (1997) J Mol Biol 269: 423-439


9.2                                                       11
Concept 3:

Structural similarity
can be measured
and similar domains
can be retrieved
from databases.
9.2                     12
RMSD metric

                                     ai                        bi
                                                d a i, b i



                                                     n        2
                                                 n i
                         RMSD coord A, B =       1 da,b
                                                       i i
                                                    =1




To calculate the RMSD, a pairwise correspondence of points has to
be defined first.
9.2                                                            13
RMSDopt
                         RMSDopt = min(RMSDcoord)
                  ai
                         RMSDopt = RMSDcoord(A, Rs x (B-Ts))
                  bi

                         The translation vector Ts and the rotation
                         matrix Ms define a superposition of the
                         vector set B on A.



An analytic solution of the superposition problem is available, but
not straightforward (involves an eigenvalue problem).

9.2                                                                   14
Superposition in practice
Prealigned structures
      • VAST    (http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml)
      • FSSP    (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html)
      • Homstrad     (http://www-cryst.bioc.cam.ac.uk/~homstrad/)
                            60        70        80        90        100
        1dro    ( 32    )   wdkVyMaAkAG-------rIsFykd-qkgyk----------snpelTfrg
        1btn    ( 23    )   whnVyCvin-------nqeMgFykd-aksaa----------sg--ipYh
        s1pls   ( 21    )   wkpmwVVLle-------dgIeFykk-ksdn---------------spk--
        1fgya   ( 281   )   wkrrwFiLTd-------ncLyYFey-ttdk---------------epr--
        1faoa   ( 181   )   wktrwFtLhr-------neLkYfkd-qm sp---------------epi--
        1qqga   ( 25    )   mhkrFFVLraaseaggparLEyYen-ekkwr----------hkssapk--
        1bak    ( 576   )   wqrryFyLfp-------nrlewrge----------------geap-----
        1dyna   ( 30    )   skeYwFvLta-------enLsWykd-deek---------------ekk--
        1dbha   ( 456   )   kherhIFLFd--------gLICCksnhgqprl--------pgasnaeyrL
        1b55a   ( 25    )   fkkrlFlLtv-------hkLsYyeydfe--r----------grrgskk--
        1mai    ( 37    )   rreRfYkLqe-----dcktIwqesr-kv-----------------mrspe
        1fhoa   ( 25    )   pKlRyVfLfr-------nkimFtEqd---ast--------s---ppsyth
        1foea   (1288   )   ePeLaAfVFk-------tAVVLVykdgskqkkklvgshrlsiyeewdpfr
                               bbbbbb         bbbbb




9.2                                                                               15
 Superposition in practice
 Web services
       • VAST      (http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml)

       • CE        (http://cl.sdsc.edu/ce.html)
       • LGA       (http://predictioncenter.llnl.gov/local/lga/lga.html)

       • Prosup (http://lore.came.sbg.ac.at:8080/CAME/CAME_EXTERN/PROSUP/)

          (Note: Click on "Rasmol" on the results page to return the alignment)




Useability and reliability of these services is variable. "Intelligent" algorithms
can superimpose without the need for user definition of correspondence.
The downside is that the user cannot define correspondences.

 9.2                                                                              16
Superposition in practice -
locally installed
Many molecular modeling programs have
superposition features:

DeepView   (http://ca.expasy.org/spdbv/)

MolMol     (http://www.mol.biol.ethz.ch/wuthrich/software/molmol/)

O          (http://alpha2.bmc.uu.se/~alwyn/o_related.html)

WhatIf     (http://www.cmbi.kun.nl/whatif/)


9.2                                                                  17
 When is RMSD misleading ?
Rigid body movement of domains or subdomains ...




                      ?
 9.2                                         18
Internal coordinates as an
alternative to superposition


      a                                        a'
          b                          c'   b'
      c
              (a,a') (b,b') (c,c')




9.2                                            19
VAST - Database searches at MMDB




9.2                           20
DALI ...




9.2        21
 ... and FSSP


                The
                prealigned
                fold-tree




9.2                   22
Workflow: MMDB ...
Open http://www.ncbi.nlm.nih.gov/
enter your search term ...




                  QuickTime™ and a TIFF (LZW) decompressor are neede d to see this picture.




9.2                                                                                           23
Workflow: MMDB ...
Choose "Structure" ...




                         QuickTime™ and a TIFF (LZW) decompressor are nee ded to see this p icture .




9.2                                                                                                    24
Workflow: MMDB ...
Choose your protein of interest ...




             QuickTime™ and a TIFF (LZW) decompressor are n eeded to see this picture.




9.2                                                                                      25
... structure summary ...




9.2                         26
... access domains similar to SH3 ...




9.2                                27
... select, download ...




9.2                        28
... display.




9.2            29
Concept 4:



Detection of similar
folds can provide
mechanistic
explanations.
9.2                30
Protein Modules
                         Modular interactions
                         between biomolecules
                         are responsible for the
                         inner workings of the
                         cell.


                         There are far more
                         modular interacting
                         proteins than classical
                         enzymes in the human
                         genome – we have
                         known this since
          Pawson & Lin    S. cerevisiae.
Protein Domains – an alphabet of
functional modules


      14-3-3   ANK3   ARM     BH1        C1          C2      CARD




      Death    DED    EFH     EH         EVH        FYVE      PDZ




       PH       PTB     SAM        SH2        SH3     WD40    WW


9.2                                                                 32
Workflow for domain architectures

                          Starting
                          from a
                          citation
                          ...




9.2                             33
 ... access sequence ...




9.2                        34
 ... display sequence ...




9.2                         35
 ... link to domain architecture ...




                              (from CDD
                              database - incl.
                              SMART and Pfam)
9.2                                      36
 ... show domain relatives ...




9.2                              37
 ... access domain information ...




9.2                                  38
 ... in CDD ...




9.2               39
 ... visualize in Cn3D.




9.2                       40
Protein structure prediction

What to do when
no structure is
known and no
homologues are
found ?
9.2                            41
Three Paths to Protein
Structure Prediction

 •    Homology Modeling


 • Threading (Fold recognition)
 • Ab initio prediction



9.2                               42
Concept 5:


Threading
methods can
sometimes find
similar folds.
9.2          43
Fold recognition ("Threading")

 Template Structure

      Query Sequence




        Query Sequence




           Query Sequence




              Query Sequence


9.2                              44
Threading Database Search
• Premise is that most sequences match some 3-D
  structure that is already known (1/2)
• Given a database of known 3-D protein folds:
      • align the test sequence to each known protein
      • in real 3-D coordinate space (slow but exact)
      • in parameterized 1-D space (fast but approximate)
• optimize some scoring function
• sort out best sequence-structure alignment
• assess alignments - statistically significant?


9.2                                                         45
Threading Statistics
• Z score (sequence composition correction)
      • number of standard deviations the found alignment is off from
        the mode of a randomized version of the structure or profile
• P value (sequence length correction)
      • Shuffle the sequence - make a distribution of random threads…
      • Is the unscrambled thread any better than a randomly
        optimized sequence…
      • Z score of Z scores
• Look for P values as a criterion for choosing a
  threading method...


9.2                                                                     46
Database Searching...
• Sensitivity
      • High sensitivity implies finding all possible
        true positive matches in the database
• Specificity
      • High specificity implies finding no false
        positive matches in the search.



9.2                                                     47
Threading as a
Database Search Method

• Has INCREDIBLY poor sensitivity
      • %10-20 on a good day

• Has INCREDIBLY poor specificity.
      • 90% of hits are false positives

• So...

9.2                                       48
Interpret Threading
Accordingly...
• In a ranked list of 10 matches, expect
  that only one might be correct
• Expect that none may be correct
• Expect that the top ranked hit is a false
  positive...



9.2                                           49
How then does
Threading find things?
• If there is a true positive in a threading
  search hit list - People find it ...
• It is most often found by FUNCTIONAL
  similarity.
      • Similar enzymatic mechanisms
      • Motifs, DART ...
      • Similar roles, cellular distributions ...

9.2                                                 50
Concept 6:


Ab initio
predictions of
structure are highly
experimental.
9.2               51
Protein structure prediction is easy:

  The assumption:
  Native structure is a global energy minimum

  The algorithm:
  1. Reasonably generate all conformations
  2. Score with an appropriate scoring function
  3. Choose the one with best score

             reasonable:    search finishes in reasonable time
             appropriate:   monotonous with q (or at least)DG,
                                 useful radius of convergence

9.2                                                        52
Why is structure prediction hard ?



• Appropriate scoring functions
• Reasonable structure generation
• Working approaches


9.2                                 53
Protein structure scoring
functions
Molecular
Mechanics

                The scoring function
                is the single most
Empirical       important
(Statistical)   component of any
                optimization !

Combinations


9.2                                    54
Protein structure scoring
functions
                            bonds
Molecular
Mechanics
                            angles




Empirical                   dihedrals
(Statistical)

                            Van der Waals


Combinations
                            Coulomb



9.2                                  55
Protein structure scoring
functions
                           Energy of
Molecular                  state i
Mechanics                                          Frequency
                                                                   Partition
                                                                   function

                E i = – kT ln f i – kT ln Z

Empirical                                                            Frequency of
                                    f
                                         ab
                                               x
(Statistical)   DE ab x = – kT ln
                                                                     observation
                                    f
                                    ab
                                          ab
                                                   x                 of a,b at
                                                                     separation x


                                                               All
                          Potential energy
Combinations                                                   observations
                          between a,b at
                                                               of a,b
                          separation x



9.2                                                                            56
Protein structure scoring
functions
Molecular
Mechanics



Empirical
(Statistical)       Usually combine
                    potential energy and
                    empirical solvation
                    terms

Combinations


9.2                                        57
Why is structure prediction hard ?



• Appropriate scoring functions
• Reasonable structure generation
• Working approaches


9.2                                 58
Combinatorially large search
spaces make enumeration
impossible.


Consider:

100 residues
3 states:
3100 ≈ 1047 conformations
9.2                            59
A Blind Golfer's view of global
optimization: I




How do you hit a hole-in-one,
when you can't even see the hole ?

How do you hit 18 holes-in-one in a row ?

9.2                                         60
A Blind Golfer's view of global
optimization: II




Change the shape of the golf course !


9.2                                     61
An analysis of why the Blind
Golfer's strategy works



              a
                   b




Local improvements in position (a) lead to
incremental improvements in energy (b) !!!

9.2                                          62
How does nature fold proteins ?
The funnel model reconciles the thermodynamic and the kinetic view !


  q


DG
In a flat folding landscape, a                             An ideal funnel results in fast,
thermodynamic minimum is               But ...             two-state folding through
kinetically inaccessible.                                  many possible pathways.


Dill KA & Chan HS (1997) From Levinthal to pathways to funnels. Nature Struct Biol 4:10-19


9.2                                                                                           63
 How does nature fold proteins
 ?
Real folding landscapes appear to be
more complex - robust folding is
possible, but so are populated
intermediate states and kinetic traps.


 What does this mean for promising
 computational strategies ?
 To the degree that folding is under
 Thermodynamic control:
    Direct inference of structure is possible
 To the degree that folding is under
 Kinetic control:
    Simulation of folding pathway is required

Dill KA & Chan HS (1997) From Levinthal to pathways to funnels. Nature Struct Biol 4:10-19


 9.2                                                                                         64
How to solve hard problems
 Simplification
 Brute force
 Branch and bound
 Heuristics
 • Local optimization
 • Simulated annealing
 • Genetic algorithms
 • Neural networks

9.2                          65
Is structure prediction NP hard ?


Not
necessarily –
nature does it
in P.


A problem that is NP-hard in principle, can be P in
practice.
This is the significance of the protein folding funnel.
Search for local solutions - subproblems !
9.2                                                       66
Why is structure prediction hard ?



• Appropriate scoring functions
• Reasonable structure generation
• Working approaches


9.2                                 67
    Ab initio prediction
Isites: Sequence - structure
motifs
HMMSTR: Hidden Markov
Model 2°-structure prediction

Rosetta: Monte carlo
fragment move based
structure generation,
Bayesian conditional
probability scoring function




Bystroff, C. & Shao, Y. (2002)
Fully automated protein
structure prediction using
ISITES, HMMSTR and
ROSETTA. Bioinformatics 18
S1: S54-S61


    9.2                          68
 Ab initio prediction
  What can you expect ?

  ~ 50 % residues < 6Å RMSD
  ~ 20% of proteins globally topologically correct
  ~ 60 % of proteins with partially topologically correct substructures




                  RMSD = 5.9Å                   RMSD = 5.9Å                RMSD = 5.9Å




Bystroff, C. & Shao, Y. (2002) Fully automated protein structure prediction using ISITES,
HMMSTR and ROSETTA. Bioinformatics 18 S1: S54-S61

 9.2                                                                                69
An ab initio Prediction
server on the WWW
                  QuickTime™ and a
           http://robetta.bakerlab.org
         TIFF (Uncompressed) decompressor
            are need ed to see this picture.




                                                        QuickTime™ and a
                                               TIFF (Uncompressed) decompre ssor
                                                  are neede d to see this picture.




9.2                                                                                  70
Open Issues


• Scoring functions:
      radius of convergence ...
• Workflow:
      what will you do with the results ?

9.2                                    71

								
To top