# Protein Homology Modelling

Document Sample

```					  Protein Homology
Modelling
Thomas Blicher
Center for Biological Sequence Analysis

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Learning Objectives
After this lesson you should be able to:
– Explain the individual steps involved in
calculating a protein homology model.
– Identify suitable templates for modelling.
– Outline the principles behind ab initio protein
structure prediction.
– Describe the differences between homology
modelling and ab initio structure prediction.
– Describe the major pitfalls in protein modelling.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Outline
 Protein homology modelling
– Individual steps
– Caveats
– Pitfalls

 Ab initio protein structure prediction
– True ab initio methods

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Why Do We Need Homology Modelling?

 Ab Initio protein folding (“random” sampling):
– 100 aa, 3 conf./residue gives approximately
1048 different overall conformations!

 Random sampling is NOT feasible, even if
conformations can be sampled at picosecond
(10-12 sec) rates.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Is It Possible?
 The structure of a protein is uniquely
determined by its amino acid sequence
(but sequence is sometimes not enough):
– prions
– pH, ions, cofactors, chaperones

 Structure is conserved much longer than
sequence in evolution.
– Structure > Function >> Sequence

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Often Can We Do It?
 There are currently ~47000 structures in the
PDB (but only ~4000 if you include only
ones that are not more than 30% identical
and have a resolution better than 3.0 Å).

 An estimated 25% of all sequences can be
modeled and structural information can be
obtained for ~50%.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Worldwide Structural Genomics
   Complete genomes
   Signaling proteins
   Disease-causing organisms
   Model organisms
   Membrane proteins
   Protein-ligand interactions

”Fold space coverage”

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Structural Genomics in North America
 10 year \$600 million project initiated in 2000,
funded largely by NIH.
 AIM: structural information on 10000 unique
proteins (now 4-6000), so far 1000 have been
determined.
 Improve current techniques to reduce time (from
months to days) and cost (from \$100.000 to
\$20.000/structure).
 9 research centers currently funded (2005), targets
are from model and disease-causing organisms (a
separate project on TB proteins).

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Homology Modeling for Structural Genomics

Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Well Can We Do It?

Sali, A. & Kuriyan, J. Trends
Biochem. Sci. 22, M20–M24 (1999)

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
How Is It Done?
   Identify template(s) – initial alignment
   Improve alignment
   Backbone generation
   Loop modelling
   Side chains
   Refinement
   Validation 

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Template Identification
 Search with sequence
– Blast
– Psi-Blast
– Fold recognition methods

 Use biological information

 Functional annotation in databases

 Active site/motifs

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Alignment

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
1   2       3        4         5         6         7          8         9         10         11       12   13   14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS

F      D I           C       R L           P       G S           A       E        A   V    C
F       6      -2 0          -3 -2 2               -2 -3 -1 -2 -3 -2 0                             -3
N -3 2                -2 -2 0               -2 -2 0               2      0       1        0   -2 -2
V       0      -2 2          -2 -1 2               -1 -1 -1 0                    -1 0         5    -2
C
R -2 -2 -2 -2 5                             -1 0           0      1      -1 0             -1 -1 -2
T
P
E       -3 2          -2 -3 0               -2 1           0      1      1       5        1   -1 -3
A       -2 0          -1 -2 -1 -1 1                        0      1      5       1        5   0    -2
I       0      -3 5          -2 -2 2               -2 -2 -1 -1 -2 -1 2                             -2
C -3 -2 -2 8                         -2 -3 -3 -2 -1 -2 -3 -2 -2 8
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
1   2       3        4         5         6         7           8        9         10         11       12   13   14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS

F      D I           C       R L           P       G S           A       E        A   V    C
F       6      -2 0          -3 -2 2               -2 -3 -1 -2 -3 -2 0                             -3
N -3 2                -2 -2 0               -2 -2 0               2      0       1        0   -2 -2
V       0      -2 2          -2 -1 2               -1 -1 -1 0                    -1 0         5    -2
C -3 -2 -2 8                         -2 -3 -3 -2 -1 -2 -3 -2 -2 8
R -2 -2 -2 -2 5                             -1 0           0      1      -1 0             -1 -1 -2
T       -2 0          0      -1 0           0      0       -1 2          0       1        0   0    -1
P       -2 0          -2 -3 0               -2 8           0      0      1       1        1   -1 -3
E       -3 2          -2 -3 0               -2 1           0      1      1       5        1   -1 -3
A       -2 0          -1 -2 -1 -1 1                        0      1      5       1        5   0    -2
I       0      -3 5          -2 -2 2               -2 -2 -1 -1 -2 -1 2                             -2
C -3 -2 -2 8                         -2 -3 -3 -2 -1 -2 -3 -2 -2 8
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Improving the Alignment
1      2     3     4         5         6         7         8         9          10        11      12   13   14
PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS
PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE CYS
PHE ASN VAL CYS ARG --- --- --- THR PRO GLU ALA ILE CYS

From ”Professional Gambling” by Gert Vriend
http://www.cmbi.kun.nl/gv/articles/text/gambling.html
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Template Quality
 Selecting the best template is crucial!
 The best template may not be the one with
the highest % id (best p-value…)
– Template 1: 93% id, 3.5 Å resolution 
– Template 2: 90% id, 1.5 Å resolution 

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
The Importance of Resolution

4Å
low

3Å

2Å

high
1Å

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Evaluation of NMR Structures

What regions in the structure are most well-defined?

Look at the pdb
ensembles to see
which regions are
well-defined

1RJH
Nielbo et al, Biochemistry, 2003

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Ramachandran Plot
 Allowed backbone torsion angles in proteins

N

H

Amino acid residue

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Template Quality – Ramachandran Plot

X-ray structure – good data.                               NMR structure – low quality data…

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Backbone Generation

 Generate the backbone coordinates from the
template for the aligned regions.

 Several programs can do this, most of the
groups at CASP6 use Modeller:

http://salilab.org/modeller/modeller.html

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Loop Modelling
 Knowledge based:
– Searches PDB for fragments that match the sequence to
be modelled (Levitt, Holm, Baker etc.).

 Energy based:
– Uses an energy function to evaluate the quality of the
loop and minimizes this function by Monte Carlo
(sampling) or molecular dynamics (MD) techniques.

 Combination

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Loops – the Rosetta Method

 Find fragments (10 per amino acid) with the
same sequence and secondary structure
profile as the query sequence.

 Combine them using a Monte Carlo scheme
to build the loop.

David Baker et al.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Side Chains

 If the seq. ID is high, the networks of side
chain contacts may be conserved, and
keeping the side chain rotamers from the
template may be better than predicting new
ones.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Predicting Side Chain Conformations
 Side chain rotamers are dependent on
backbone conformation.

 Most successful method in CASP6 was
SCWRL by Dunbrack et al.:
– Graph-theory knowledge based method to solve
the combinatorial problem of side chain
modelling.

http://dunbrack.fccc.edu/SCWRL3.php

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Side Chains - Accuracy
 Prediction accuracy is high for buried
residues, but much lower for surface
residues
– Experimental reasons:
side chains at the surface are more flexible.
– Theoretical reasons:
much easier to handle hydrophobic packing in
the core than the electrostatic interactions,
including H-bonds to waters.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Refinement
 Energy minimization
 Molecular dynamics

– Big errors like atom
clashes can be removed,
but force fields are not
perfect and small errors
will also be introduced –
keep minimization to a
minimum or matters will
only get worse.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Error Recovery
 If errors are introduced in the model, they
normally can NOT be recovered at a later
step
– The alignment can not make up for a bad choice
of template.
– Loop modeling can not make up for a poor
alignment.
 If errors are discovered, the step where they
were introduced should be redone.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Validation
 Most programs will get the bond lengths and
angles right.
 The Ramachandran plot of the model usually looks
pretty much like the Ramachandran plot of the
template (so select a high quality template).
 Inside/outside distributions of polar and apolar
residues can be useful.
 Biological/biochemical data
– Active site residues
– Modification sites
– Interaction sites

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Validation – ProQ Server
 ProQ is a neural network based predictor
that based on a number of structural
features predicts the quality of a protein
model.

 ProQ is optimized to find correct models in
contrast to other methods which are
optimized to find native structures.

Arne Elofssons group: http://www.sbc.su.se/~bjorn/ProQ/
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Structure Validation
 ProCheck
http://www.biochem.ucl.ac.uk/~roman/procheck/proc
heck.html

 WhatIf server
http://swift.cmbi.kun.nl/WIWWWI/

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Homology Modelling Servers

 Eva-CM performs continuous and automated
analysis of comparative protein structure
modeling servers
 A current list of the best performing servers
can be found at:

http://cubic.bioc.columbia.edu/eva/doc/intro_cm.html

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Summary – Homology Modelling

 Successful homology modelling depends on
the following:
– Template quality
– Modelling program/procedure (use more than
one)

 Always validate your final model!

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Fold Recognition and Ab Initio
Protein Structure Prediction

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Outline
 Ab initio methods
 Human intervention (what kind of knowledge can be used for
alignment and selection of templates?)

 Meta-servers (the principle, 3d jury)
 Summary of take-home messages

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU

 Compares a                                     Example: Pair potentials
given sequence                                 How normal is it to observe
against known                                  a pair of an alanine and a
structures                                     valine separated by 20
(folds).                                       residues in the sequence
 By using                                       and 3Å in space? (X)
potentials that
describe                                       How normal is it to observe
tendencies                                     any pair of residues
observed in                                    separated by 20 residues
known protein                                  and 3Å in space? (Y)
structures.
Potential: log (X/Y)
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Potentials of Mean Force
Alignment score from
Deletions
structural fitness (pair
7
potential)                                                              4                      6
2
5       8

9       10
1                                       3

How well does K fit                           .. A T N L Y K E T L ..
environment at P6?
If P8 is acidic then
fine, if P8 is basic then
poor
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
 Problem: No protein is average
 Interactions in proteins cannot only be described
by pairs of amino acids
 The information in the potentials is partly captured
with sequence profiles
 Today mostly used in HYBRID approaches in
combination with profile-profile based methods
 Potentials can be used to score models based on
different templates or alignments

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Ab Initio Methods

 Aim is to find the fold of native protein by simulating the
biological process of protein folding.

 A VERY DIFFICULT task because a protein chain can fold
into millions of different conformations.

 Use it only when no detectable homologues are available.

 Methods can also be useful for fold recognition in cases of
extremely low homology (e.g. convergent evolution).

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Fragment-based Ab Initio Modelling

 Rosetta method of the Baker group:
– Submit sequence to a number of secondary
structure predictors.
– Compare fragments of 3 and 9 residues to
library from know structures.
– Use energy minimization techniques (Monte
Carlo optimization) to calculate tertiary
structure.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Potentials for Finding Good Models

 Use of energy potentials for scoring and computing
models.
 Potentials should make models more “native-like”.
 These can be based on contact potentials,
solvation potentials, Van der Waals repulsion and
attractive forces, hydrogen bond potentials.
 Globularity/radius of gyration (ab initio).

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Problems with Empirical Potentials

Fragments with
correct local structure

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Human Intervention

 The best methods                            Knowledge of function
use maximum                                 Cysteines forming disulfide bridges or
knowledge of query                          binding e.g. zinc molecules
proteins.                                   Proteolytic cleavage sites
Other metal binding residues
Antibody epitopes or escape mutants
 Specialists can help
to find a correct                           Ligand binding
template and correct                        Results from CD or fluorescence
alignments.                                 experiments
Knowledge of secondary structure

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Meta-Servers
 Democratic modeling
– The highest score hit is often wrong.
– Many prediction methods have the correct fold
among the top 10-20 hits.
– If many different prediction methods all have
some fold among the top hits, this fold is
probably correct.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Example of a Meta-Server
   3DJury http://bioinfo.pl/meta/

–   Inspired by Ab initio modeling methods
•   Average of frequently obtained low energy structures is often
closer to the native structure than the lowest energy structure
–   Find most abundant high scoring model in a list of
prediction from several predictors
1. Use output from a set of servers
2. Superimpose all pairs of structures
3. Similarity score based on # of Cα pairs within 3.5Å
–   Similar methods developed by A. Elofsson (Pcons)
and D. Fischer (3D shotgun).

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
3DJury
 Because it is a meta-server it can be slow.
 If queue is too long some servers are
skipped.
 Output is only Cα coordinates.
 What to do with the rest of the structure?
 Use e.g. maxsprout server to build
sidechains and backbone atoms.
http://www.ebi.ac.uk/maxsprout/

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU
Summary – Ab Initio Methods

 Hybrid methods using both threading methods and profile-
profile alignments are the best.

 Use only Ab initio methods if necessary and know that the
quality is really low!

 Try to use as much knowledge as possible for alignment
and template selections in difficult cases.

 Use meta-servers when you can.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS TECHNICAL UNIVERSITY OF DENMARK DTU

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 151 posted: 9/24/2010 language: Danish pages: 48