Docstoc

Protein Structure Prediction

Document Sample
Protein Structure Prediction Powered By Docstoc
					                    G53BIO – Bioinformatics
              http://www.cs.nott.ac.uk/~jqb/G53BIO

                                Protein Structure Prediction
                   Dr. Jaume Bacardit – jqb@cs.nott.ac.uk
                 Prof. Natalio Krasnogor – nxk@cs.nott.ac.uk




Some material taken from “Arthur Lesk Introduction to Bioinformatics 2nd edition Oxford University
Press 2005” and “Introduction to Bioinformatics by Anna Tramontano”
                   Outline
• Introduction and motivation
• PSP: A family of problems
• Prediction of structural aspects of protein
  residues
• Prediction of the 3D structure of proteins
• Assessment of PSP quality: CASP
• Summary
   Protein Structure: Introduction
• Proteins are molecules of primary importance for the
  functioning of life
   – Structural Proteins (collagen nails hair etc.)
   – Enzymes
   – Transmembrane proteins
• Proteins are polypeptide chains constructed by joining
  a certain kind of peptides amino acids in a linear way
• The chain of amino acids however folds to create very
  complex 3D structures
• There is a general consensus that the end state of the
  folding process depends on the amino acid
  composition of the chain
             Motivation for PSP

 The function of a protein depends greatly on its
  structure
 The structure that a protein adopts is vital to it’s
  chemistry
 Its structure determines which of its amino acids
  are exposed to carry out the protein’s function
 Its structure also determines what substrates it
  can react with
 However the structure of a protein is very
  difficult to determine experimentally and in
  some cases almost impossible
     Protein Structure Prediction
• That is why we have to predict it
• PSP aims to predict the 3D structure of a protein
  based on its primary sequence
                   Impact of PSP
 PSP is an open problem. The 3D structure
  depends on many variables
 It has been one of the main holy grails of
  computational biology for many decades
• Impact of having better protein structure models
  are countless
  –   Genetic therapy
  –   Synthesis of drugs for incurable diseases
  –   Improved crops
  –   Environmental remediation
           Prediction types of PSP
• There are several kinds of prediction problems within
  the scope of PSP
   – The main one of course is to predict the 3D coordinates of
     all atoms of a protein (or at least the backbone) based on
     its primary sequence
   – There are many structural properties of individual residues
     within a protein that can be predicted for instance:
      • The secondary structure state of the residue
      • If a residue is buried in the core of the protein or exposed in the
        surface
   – Accurate predictions of these sub-problems can simplify
     the general 3D PSP problem
        Prediction types of PSP
• There is an important distinction between the
  two classes of prediction
• The 3D PSP is generally treated as an
  optimisation problem
• The prediction of structural aspects of protein
  residues are generally treated as machine
  learning problems
                   Optimisation
• Given a problem for which you have a way of
  assessing how good is each possible solution
   – An evaluation function
• Optimisation is the process of finding the best
  possible solution
• Dynamic programming (as seen for sequence
  alignment) is an optimisation method
• Genetic Algorithms are another examples of
  optimisation
• The key differences between them is how they
  explore the space of candidate solutions
           Machine Learning
• Machine learning: How to construct
  programs that automatically learn from
  experience [Mitchell 1997]
• ML is a Computer Science discipline part of
  the Artificial Intelligence field
• Its goal is to construct automatically a
  description of some phenomenon given a set
  of data extracted from previous observations
  of the phenomenon because it would be
  beneficial to predict it in the future.
  Flow of data in machine learning
• Specifically we are concerned with supervised
  learning. That is when we know the solution
  for the training data               Unknown instance




                  Learning
 Training Set                         Theory
                  Method


                                        Class
           Types of machine learning
    • Rule learning

     1
                           If (X<0.25 and Y>0.75) or
                             (X>0.75 and Y<0.25) then 
                           If (X>0.75 and Y>0.75) then 
Y                          If (X<0.25 and Y<0.25) then 
                           Everything else           




     0                 1
                  X
Other machine learning techniques
• Other methods that have also been used in
  PSP are
  – Artificial Neural Networks
  – Support Vector Machines
  – Hidden Markov Models
• If you are interested in the technology side of
  PSP a good book is “Bioinformatics: The
  Machine Learning Approach” by Baldi and
  Brunak
 Prediction of structural aspects of
          protein residues
• Many of these features are due to local interactions of an
  amino acid and its immediate neighbours
   – Can it be predicted using information from the closest
     neighbours in the chain?

             Ri-5    Ri-4    Ri-3    Ri-2    Ri-1   Ri     Ri+1    Ri+2    Ri+3    Ri+4    Ri+5
            SSi-5   SSi-4   SSi-3   SSi-2   SSi-1   SSi   SSi+1   SSi+2   SSi+3   SSi+4   SSi+5




                                Ri-1 Ri Ri+1  SSi
                                Ri Ri+1 Ri+2  SSi+1
                                Ri+1 Ri+2 Ri+3  SSi+2
   – In this simplified example to predict the SS state of residue
     i we would use information from residues i-1 i and i+1.
     That is a window of ±1 residues around the target
What information do we include
       for each residue?
– Early prediction methods used just the primary
  sequence  the AA types of the residues in the
  window
– However the primary sequence has limited
  amount of information
   • It does not contain any evolutionary information it does not
     say which residues are conserved and which are not
– Where can we obtain this information?
   • Position-Specific Scoring Matrices which is a product of a
     Multiple Sequence Alignment
     Position-Specific Scoring Matrices
                  (PSSM)
– For each residue in the query sequence compute
  the distribution of amino acids of the corresponding
  residues in all aligned sequences (discarding those
  too similar to the query)
– This distributions will tell us which mutations are
  likely and which mutations are less likely for each
  residue in the query sequence
– In essence it’s similar to a substitution matrix but
  tailored for the sequence that we are aligning
– A PSSM profile will also tell us which residues are
  more conserved and which residues are more
  subject to insertions or deletions
   PSSM for the 10 first residues of 1n7lA
  A R N D C Q E G H I L K M F P S T W Y V
A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0
M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1
E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3
K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3
V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5
Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3
Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2
L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1
T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0
R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
   Secondary Structure Prediction
– The most usual way is to predict whether a
  residue belongs to an α helix a β sheet or is in
  coil state
– Several programs can determine the actual SS
  state of a protein from a PDB file. The most
  common of them is DSSP
– Typically, a window of ±7 amino acids (15 in
  total) is used
      Secondary Structure Prediction

R1 R2 R3      Rn-1 Rn                          PSSM1 PSSM2 PSSM3           PSSMn-1 PSSMn
                                MSA
Primary sequence                                   PSSM profile of sequence




   SSi?
                   Prediction     PSSMi-1 PSSMi PSSMi+1
                                                                    Windows
                    method                                         generation

Prediction                      Window of PSSM profiles



 •The most popular public SS predictor is PSIPRED
    Coordination Number Prediction
  Two residues of a chain are said to be in contact if their
    distance is less than a certain threshold (e.g. 8Å)

                                               Native State
Primary                         Contact
Sequence




  CN of a residue : count of contacts that a certain
   residue has
  CN gives us a simplified profile of the density of packing
   of the protein
on




     – All AA types associated to the central residue are
       hydrophobic (core of a protein)
     – D E consistently do not appear in the predicates.
       They are negatively charges residues (surface of a
       protein)
                     Other predictions
• Other kinds of residue
  structural aspects that can
  be predicted
   – Solvent accessibility: Amount
     of surface of each residue that
     is exposed to solvent
   – Recursive Convex Hull: A
     metric that models a protein
     as an onion and assigns each
     residue to a layer. Formally
     each layer is a convex hull of
     points
• These features (and
  others) are predicted in
  a similar was as done for
  SS or CN
          Contact Map prediction
• Prediction given two residues
  from a chain whether these
  two residues are in contact or
  not
• This problem can be
  represented by a binary
  matrix. 1= contact 0 = non
  contact
• Plotting this matrix reveals
  many characteristics from the
  protein structure

                              helices   sheets
        Contact Map Prediction
• Instead of a single window around the target
  now there are two windows around the pair
  of residues to be predicted to be in contact or
  not
• Many methods also use a third window,
  placed in the middle point in the chain
  between the two target residues
      Contact Map prediction at
            Nottingham
• For each position in these 3 windows we
  include:
  – PSSM profile
  – Predicted SS, SA, RCH and CN
• The whole connecting segment between the
  two targets is represented as
  – Distribution of AA and predicted SS, SA, RCH and
    CN
      Contact Map prediction at
            Nottingham
• Moreover, global protein information is also
  included
  – Sequence length
  – Separation between target residues
  – Contact propensity of target residues
  – Distribution of AA and predicted SS, SA, RCH and
    CN of the whole chain


• Each instance is represented by 631 variables
         Contact Map prediction at
               Nottingham
• Training Set of 1,400 proteins selected to represent a broad set of
  sequences
• These proteins contain 15.2 million pairs of residues (instances in the
  training set) with less than 2% of real contacts
• 50 samples of 300,000 examples are generated from the training set. Each
  sample contains two no-contact instances for each contact instance
• Our BioHEL GBML method (Bacardit et al., 2007) is run 25 times on each
  sample
• An ensemble of 1250 rule sets (50 samples x 25 seeds) performs the
  contact maps predictions using simple consensus voting
• Confidence is computed based on the votes distribution in the ensemble.
  That is, the estimated probability that the predicted contact is a true
  contact
    3D Protein Structure Prediction
•   Approaches for 3D PSP
•   Template-Based Modelling
•   Ab-Initio methods
•   State-of-the-Art methods
    – I-Tasser
    – Rosetta
         Approaches for 3D PSP

• Some PSP methods try to identify a template
  protein and then adapt the structure of the
  template to the target protein  Template-
  based Modelling
• Other methods try to generate the structure
  of the protein from scratch (Ab Initio
  Modelling) optimizing some energy function
  that models the stability of the protein, in case
  that no template can be identified
         Pipeline for Template-based
                  Modelling
• Typical steps
   1.   Identify the template (next slide)
   2.   Produce the final alignment between the residues of target and template
   3.   Determine main chain segments to represent the regions containing
        insertions and deletions (gaps in the alignment) and stitch them into the
        main chain of the template to create an initial model for the target
   4.   Replace the side chains of residues that have been mutated (mismatches
        in the alignment) although it is possible that the conformation in the
        template is still conserved
   5.   Examine the model to detect any serious atom collision and relieve them
   6.   Refine the model by energy minimization. This stage is meant to adapt
        the stitched segments to the conserved structure and to adjust the side
        chains so find the most stable conformation
Loop remodelling
        Template identification
• Can we find a sequence with known structure
  and high sequence identify with the target?
  • Homology Modelling
• Still, there is a template (structure similar to
  that of the target) but it has poor sequence
  identity. We need to identify it by other means
  • Fold recognition
     • Profile-based methods
     • Threading methods
         Profile-based Methods
• Aim is to construct 1D representations
  (profiles) of the structures in our fold database
• Afterwards, when a target sequence comes,
  we construct its profile and check our
  database for the most similar profile
• That is, instead of aligning amino acid
  sequences, we align structural 1D profiles
    How to construct the profile?
• We choose a series of structural properties of
  residues
  – Most frequent secondary structure state
     • Alpha helix, Beta sheet, other
  – Solvent Accessibility
     • < 40Å2, >100Å2, intermediate
  – Hydrophobic/polar
• For each amino acid, we decide to which
  category it belongs based on statistics
  computed on a large database of structures
     How to construct the profile?
              Alpha helix    Beta sheet        Other
<40Å2       Hydrophobic: a Hydrophobic: b Hydrophobic: c
            Polar: d       Polar: e       Polar: f

>100Å2      Hydrophobic: g Hydrophobic: h Hydrophobic: i
            Polar: j       Polar: k       Polar: l

 intermediat Hydrophobic: Hydrophobic: n Hydrophobic:
•e Now the sequence for Polar: qproteinoin our
             m            each
             will p
   databasePolar: have a new structuralPolar: r
  representation
• We need to predict SS and Acc for the template
          Threading methods
• We start with compiling a catalogue of unique
  folds (filtering out repeats)
• Afterwards, we evaluate how likely it is that
  the target sequence adopts each of the folds,
  and how (alignment)
• Name is a metaphor taken from tailoring, as
  we are are trying to fit the sequence (a
  thread) through a known structure
• We will choose the template (and alignment)
  that has the lowest (estimated) energy
           Threading methods
• Energy estimation needs to be simple and fast
  – As we need to evaluate all possible folds and
    alignments
• Energy is the product of all the pair-wise
  interactions ocurring in a protein
• Thus, the energy estimation will be computed
  as the sum of the energy terms for every pair
  of residues in the protein
• How to compute the energy interaction for a
  given pair of amino acids?
     Pair-wise Energy estimation
• Boltzmann’s equation states that the
  probablity of observing a given event depends
  on its energy
  – P(x) = e(E(x)/KT)
• If we reverse this equation we get:
  – E(x) = -KT ln[ P(x) ]
• We can compute P(x), for each pair of amino
  acids from a database of known structures as
  the frequency in which these amino acids are
  observed to be in contact
       Alignment within threading
• We still need to solve the problem of the correspondence of
  the residues in our template with those of the target
• This is a very difficult problem, as a change in an alignment
  can have impact in the interaction with many residues
• There is an exact (but costly) solution
• Instead, most methods adopt an approximate method called
  frozen approximation
• When evaluating the possibility of assigning one of the amino
  acids of the target to a certain position in the template,
  instead of computing the interactions with the rest of the
  target residues, we will use those of the template
Frozen Apporximation
     Aligning target and template
• Crucial step before generating the initial model
• It is possible, specially for homology modelling, that
  the best sequence alignment does not correspond to
  the best structural alignment
   – That is, finding the best correspondence between the
     coordinates of each amino acid of target and template
• In this case, a better alignment process needs to be
  performed, to do se, we can use
   – Information derived from the template’s structure
   – Predicted for the target
      Aligning Target and Template




                                   Wrong alignment. Some atoms are
Correct alignment after shifting   too close (big circle). Some atoms
                                   are too far (small circle)
   The poor man approach to
     homology modelling
– To find templates
   • PSI-BLAST
   • 3D Jury. This program is a meta-server. That is it asks
     many other servers what templates would they choose
     and then produces a consensus decision based on the
     answers of the servers
– To produce a model of a protein given a template
   • MODELLER. Very popular homology modelling package.
     Free for academic use
– To refine the side-chain conformations
   • SCWRL
           Ab-Initio modelling

• In general this kind of modelling is still quite
  primitive when compared to homology
  modelling
• However without a target it is the only choice
• Pure ab-initio modelling is still very costly and
  ineffective but hybrid homology/ab-initio
  methods such as fragment assembly have
  better performance
             Ab-Initio modelling
• The most advanced ab-initio method is fragment
  assembly
  – Consists by breaking up the sequence in small
    subsegments of 3 to 9 residues and generating structure
    for these segments based on a large library of known
    fragments
  – Decoys are generated from all possible combinations of
    fragments
  – An energy minimization process is applied to all decoys.
  – Decoys are clustered and the final models are selected
    from the center of the largest clusters
              Energy minimisation




Energy minimization is not easy. We may need to go uphill before we can
reach the lowest energy conformation
Energy functions for ab-initio methods
• Energy function needs to take into account the
  interactions of all atoms of all amino acids
• Many different types of energy sources
   –   Covalent bonds
   –   Angles and torsions of bonds between atoms
   –   Van der Waals interactions (repulsion/attraction)
   –   Energy of charged atoms
   –   Interactions with solvent
   –   Hydrogen bonds
• Exact formulas are very costly, so generally PSP
  methods use knowledge-based potentials, computed
  from a large database of structures
                   I-Tasser
• Prediction method from Zhang’s group
• Fully automated server, without any human
  intervention
• Steps
  – Template identification
  – Structure assembly
  – Atomic model construction
  – Model selection
  I-Tasser: Template Identification
• MUSTER fold recognition method, used both for whole
  proteins (TBM) or for fragments (Ab Inition)
• Profile-based fold recognition
   –   Secondary structure
   –   Structural frament profile
   –   Solvent accessibility
   –   Backbone torsion angle
   –   Hydrophobicity
• For the most difficult targets, a meta-server that
  combines the outputs of various methods is used
    I-Tasser: Structure assembly
• Generation of a preliminary model with only
  coordinates for Cα and sidechain positions
• Using the template as starting point where
  possible and ab-initio methods for amino
  acids without alignment
• Two iterations of refinement
  – 1st based on templates
  – 2nd based on clustering the models of the previous
    iteration and using the centroids of each cluster as
    starting points
        I-Tasser energy function
• Knowledge-based statistics of
  – Cα – sidechain correlation
  – H-bonds
  – Hydrophobicity
• Spatial restraints of templates
• Contact Map prediction from SVMSEQ
  – 9 predictions included, combinations of
  – Contacts between Cα, Cβ or side chain centers
  – Contact cut-offs of 6, 7 or 8 Å
I-Tasser atomic model construction
• Full-atom models are constructed from the
  approximate models produced by the cluster
  centroids
• 1st the backbone is matched with a large
  library of template fragments with high
  resolution structure
• Then full-atom optimization occurs focusing
  on H-bonds, removing clashes and using the
  Charmm22 molecular dynamics force field
       I-Tasser model selection
• Several full-atom models are generated from
  each cluster centroid
• Models need to be ranked to select the best
  one
• I-Tasser uses a weighted sum of
  – Number of H-Bonds / target length
  – TM-score (metric to compare structures) between
    the full-atom model and the centroid cluster
                      Rosetta
• Predictor from David Baker’s group
• It uses a massive distributed computing infrastructure
  (Rosetta@home)
• For CASP7 in 2006 it claimed to dedicate up to 104 cpu
  years/target
• Template identification used a variety of methods
  depending on sequence identity between target and
  template
• Different protocols for Template-Based Modelling and
  Free Modelling (fragment assembly)
• 3 variants of TBM depending on degree of homology
  between target and template
                           Rosetta
• Full-atom refinement protocol
   – Energy function based on
      • Short-range interations: Van der Waals energe, H-bonds and
        solvent accessibility
      • Long range interactions (dampening of electrostatic interactions)
   – Minimization through Monte Carlo with the following
     steps:
      • Perturbation of a randomly selected angle from the backbone
      • Optimisation of side-chain rotamer conformations
      • Optimisation of both backbone and sidechain torsion angles
                      PSP and CASP
• PSP has improved through the years. This improvement has been
  assessed mainly in CASP
• CASP = Critical Assessment of Techniques for Protein Structure
  Prediction
• It is a biannual community exercise to evaluate the state-of-the-art
  in PSP
• Every day for about three months the organizers release some
  protein sequences for which nobody knows the structure (128
  sequences were released in CASP8 in 2008)
• Each prediction group is given three weeks to return their
  predictions. 24 hours are give to automated servers
• Then at the end of the year experts meet in a place close to the
  sea to discuss the results of the experiment 
                    CASP categories
• Several categories of experiments are assessed in CASP
   –   Template-Based Modeling (Homology and fold recognition)
   –   Free Modeling (no template i.e. ab initio)
   –   Contact Map prediction
   –   Functional sites prediction
   –   Domain prediction
   –   Disordered regions
   –   Quality assessment
• Categories have changed through time
   – SS prediction is not assessed anymore after CASP4
   – Homology modeling and fold recognition merged into TBM
        Progress through CASP
  (From Nick Grishin’s Humans vs Servers presentation in CASP8)

 1. Computers help structure prediction:
               no more paper models
 2. Knowledge-based potentials work better.

 3. Local “threading” and fragment assembly
                      (Baker)
  4. Averaging and consensus methods work:
        meta-servers (Ginalski-Rychlewski)
 5. Sequence profile methods are as
 (or more powerful) than threading: (Sốding)
6. Jamming poorly similar templates together
        helps: (Skolnick-Zhang)
         Assessment of 3D PSP
• How can we quantify how good is a model?
• That is, how similar is a model structure to the
  actual (native) one?
• We will see this in depth when we cover the
  protein structure comparison topic, later in the
  module
• Now we are just going to describe the most
  popular metric, GDT-TS
                    GDT-TS
• Global Distance Test – Total Score
• This measure tries to produce a balance
  between good local and global similarity of
  structures (unlike RMSD)
• If a measure only takes a global point of view,
  good models that only fail badly in a few
  amino acids could be discarded
                 GDT-TS steps
1. All segments of 3, 5 and 7 consecutive amino acids
   from the model are superimposed to the actual
   structure.
2. Each of them will be iteratively extended while they
   are good enough
3. Good enough = Distance between all residue pairs
   (represented by their Cα atoms) is less than a
   certain threshold
4. A final superposition includes the set of segments
   covering as many residues as possible
5. Segments do not need to be continuous
               GDT-TS metric
• The process of superposition is performed
  four times, using thresholds of 1, 2, 4 and 8 Å




• The reason for including 4 different thresholds
  is to have a metric which is good both for high
  accuracy models and for approximate models
                    GDT-HA
• HA = High Accuracy
• Set of thresholds in GDT-TS changed to 0.5, 1,
  2 and 4
• For high accuracy GDT just provide a crude
  approximation (backbone). So other measures
  are taken into account
  – H-bonds
  – Position and rotation of sidechains
  – Clashes of atoms
  Other CASP prediction categories
• Functional sites prediction
   – Predicting which residues of a given sequence are those that perform
     the chemistry of the protein
   – Bind to other proteins/compounds
   – Methods can use whatever information they can infer to perform this
     prediction
   – However, most predictions can be performed simply by homology 
• Domain prediction
   – Domains = quasi-independent subsets of a protein, that fold on their
     own
   – Their prediction follows a simple divide-and-conquer motivation
   – It is much easier to create separate models for the different domains
     of a protein
   Disordered regions prediction
• Regions of a protein that do not fold into a
  unique pattern (no coordinates in the PDB file)
• 75% of mammal signaling proteins are
  estimated to contain long (>30) disordered
  regions, and 25% of the total amount of
  proteins may be fully disordered
• Thus, it is useful to predict from the sequence
  if that is the case
Disordered protein 2K5K
   Quality assessment prediction
• Given a model, can we predict how good it is
  (without comparing it to the native structure)?
• Overall and per-residue model quality
• Prediction was done based on the models
  from the server category
• Two families of methods
  – That perform predictions for individual models
  – That take a set of models and give predictions
    based on consensus agreements
     Simple Homology Modelling
• We are going to use Modeller
• Free for academic use
• http://salilab.org/modeller/9v6/modeller9v6.exe
• Licence key: MODELIRANJE
• Modeller is a very sophisticated tool where you can
  control almost any aspect of the homology modelling
  process
• Here we are only going to use the simplest options
• Modeller has no interface. To use it we have to write
  python scripts
       Chain we are going to model
 ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGK
 VSLVVNVASDCQLTDRNYLGLKELHKEFGPSHF
 SVLAFPCNQFGESEPRPSKEVESFARKNYGVTF
 PIFHKIKILGSEGEPAFRFLVDSSKKEPRWNFWK
 YLVNPEGQVVKFWRPEEPIEVIRPDIAALVRQVII
 KKKEDL
      T0388 LOC493869A, Homo sapiens
     CASP target ID

This sequence was one of the targets of the CASP8 experiment
1st step: BLAST against PDB
            Selecting the template

• The perfect match
  exists, because
  right now the
  structure for this
  target is already
  public
• We are going to
  ignore it, and use
  chain A of pdb
  entry 2p31
  instead
   2nd step: Creating an alignment
• Modeller has a sophisticated alignment tool
   – Uses structural information from the template
   – Dynamic programming instead of the approximate method
     of blast
• To create the alignment you need to:
   1. Download the PDB file of the template
   2. Put your sequence in PIR format (example)
   3. Edit the alignment script to set the template and chain
   4. Call modeller: mod9v6.exe align.py
                       PIR file
• Just replace the sequence with your own one
• The last line in the sequence needs to end in *
• Do not touch anything else from the file, or
  the alignment script will not work
• File name: target.ali
>P1;target
sequence:target:::::::0.00: 0.00
ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKE
FGPSHFSVLAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLV
DSSKKEPRWNFWKYLVNPEGQVVKFWRPEEPIEVIRPDIAALVRQVIIKKKEDL*
                                   Align.py
from modeller import *
from modeller.automodel import *

env = environ()
aln = alignment(env)

template='2p31'                        Just change the value of these 2 lines
chain='A'                              with your template
tc=template+chain

mdl = model(env, file=template, model_segment=('FIRST:'+chain,'LAST:'+chain))
aln.append_model(mdl, align_codes=tc, atom_files=template+'.pdb')
aln.append(file='target.ali', align_codes='target')
aln.align2d()
aln.write(file='target-'+tc+'.ali', alignment_format='PIR')
aln.write(file='target-'+tc+'.pap', alignment_format='PAP')
  • Alignment is different from that produced by BLAST
  • Modeller has ignored the residues lacking structural
    information




 _aln.pos   10     20      30       40      50     60
2p31A -----Q----DFYDFKAVNIRGKLVSLEKYRGSVSLVVNVASECGFTDQHYRALQQLQRDLGPHHFNV
target
ENLYFQSMINSFYAFEVKDAKGRTVSLEKYKGKVSLVVNVASDCQLTDRNYLGLKELHKEFGPSHFSV
 _consrvd * ** *     * ****** * ********* * ** * * * ** ** *

 _aln.p 70      80     90      100       110   120     130
2p31A
LAFPCNQFGQQEPDSNKEIESFARRTYSVSFPMFSKIAVTGTGAHPAFKYLAQTSGKEPTWNFWKYLV
target
LAFPCNQFGESEPRPSKEVESFARKNYGVTFPIFHKIKILGSEGEPAFRFLVDSSKKEPRWNFWKYLV
 _consrvd ********* ** ** ***** * * ** * ** * *** * * *** ********

_aln.pos 140   150   160   170
                        Creating the model
from modeller import *
from modeller.automodel import *                                   • 5 models are
log.verbose()
env = environ()                                                      created
template='2p31'
                                                                   • Each of them
chain='A'
                                                                     can be slightly
tc=template+chain
                                                                     different
class MyModel(automodel):
      def get_model_filename(self,sequence, id1, id2, file_ext):
            return sequence+'_'+`id2`+file_ext                     • Models are
     def special_restraints(self, aln):                              going to be
           rsr = self.restraints

a = MyModel(env, alnfile='target-'+tc+'.ali',
                                                                     assessed using
          knowns=tc, sequence='target',
          assess_methods=(assess.DOPE, assess.GA341))
                                                                     2 different
a.starting_model = 1
a.ending_model = 5                                                   criteria
a.make()
                   Results of the modelling
>> Summary of successfully produced models:
Filename                          molpdf DOPE score GA341 score
----------------------------------------------------------------------
target_1.pdb                    1280.53101 -19077.32812                1.00000
target_2.pdb                    1570.33606 -18480.83008                1.00000
target_3.pdb                     960.32550 -19365.79102                1.00000
target_4.pdb                    1415.41724 -18980.71094                1.00000
target_5.pdb                    1463.82593 -19077.91016                1.00000



  • According to DOPE score, 3 is the best model
    and 2 the worst
  • The lowest the DOPE score, the better
  • Let’s see how different are the models
Viewing the two models from pymol
1. Open model 3 as usual
2. But then, instead of double-clicking model 2,
   open it from inside pymol using File  open
3. The models are not aligned
    Type: align target_3,target_2
• The only differences are in the two ends of the
  chain
 So how does the model compare
     to the real protein 3CYN?
• The residues at both ends of the chain are
  wrong
        Can we do any better?
• We can give modeller information about the
  secondary structure of the target
• We can get these predictions from PSIPRED
   CCCCCCCCCCCEEEEEEECCCCCEECHHHHCCCEEEEEECC
   CCCCCCHHHHHHHHHHHHHHCCCCEEEEEEECCCCCCCCC
   CCHHHHHHHHHHCCCCCHHEEEEEECCCCCCCHHHHHHHH
   CCCCCCCCCCEEEEECCCCCEEEEECCCCCHHHHHHHHHHH
   HHHHHHHHHCCC

• Then, the modelling script needs to be
  modified
from modeller import *
from modeller.automodel import *

log.verbose()
env = environ()

template='2p31'
chain='A’
tc=template+chain

class MyModel(automodel):
       def get_model_filename(self,sequence, id1, id2, file_ext):
              return sequence+'_'+`id2`+file_ext

      def special_restraints(self, aln):
             rsr = self.restraints
             rsr.add(secondary_structure.strand(self.residue_range('12:', '18:')))
             rsr.add(secondary_structure.strand(self.residue_range('24:', '25:')))
             rsr.add(secondary_structure.alpha(self.residue_range('27:','30:')))
             rsr.add(secondary_structure.strand(self.residue_range('34:', '39:')))
             rsr.add(secondary_structure.alpha(self.residue_range('48:','61:')))
             rsr.add(secondary_structure.strand(self.residue_range('66:', '72:')))     Pred
             rsr.add(secondary_structure.alpha(self.residue_range('84:','93:')))       SS
             rsr.add(secondary_structure.alpha(self.residue_range('99:','100:')))
             rsr.add(secondary_structure.strand(self.residue_range('101:', '106:')))   info
             rsr.add(secondary_structure.alpha(self.residue_range('114:','121:')))
             rsr.add(secondary_structure.strand(self.residue_range('132:', '136:')))
             rsr.add(secondary_structure.strand(self.residue_range('142:', '146:')))
             rsr.add(secondary_structure.alpha(self.residue_range('152:','171:')))

a = MyModel(env, alnfile='target-'+tc+'.ali',
           knowns=tc, sequence='target',
           assess_methods=(assess.DOPE, assess.GA341))
a.starting_model = 1
a.ending_model = 5
a.make()
     And here is the new model,
    compared to the real protein
• Now at least we got right one end of the
  protein
             Summary of topic
• Importance of PSP
• Many different types of prediction included in
  the PSP family
  – 3D PSP
  – Prediction of amino acid structural features
  – Others
• Families of 3D PSP
  – Template-based Modelling
  – Free modelling
• Basic practical homology modelling

				
DOCUMENT INFO