Homology modeling workshop

Document Sample
Homology modeling workshop Powered By Docstoc
					  Homology Modeling
• Introduction to protein structure & databases

• Structure prediction approaches
  – Ab-initio
  – Threading
  – Homology modeling

• Hands ON
    From Sequence to Structure
Protein structure is hierarchic:
•   Primary – sequence of covalently attached amino acid
•   Secondary – local 3D patterns (helices, sheets, loops)
•   Tertiary – overall 3D fold
•   Quaternary – two or more protein chains
  From Sequence to Structure
• All information about the native structure of a protein is
  encoded in the amino acid sequence + its native solution

• Many possible conformation  still only one or few native
  folds are exhibited for each protein (Levinthal‟s paradox)

• Protein folding is driven by various forces:
   – Ionic forces
   – Hydrogen bonds
   – The hydrophobic affect
   – ...
     Protein 3D Structures
A protein‟s structure has a critical effect on its function:

                  1. Binding pockets

                                                     PDB ID 1nw7
     Protein 3D Structures
A protein‟s structure has a critical effect on its function:

   2. Areas of specific chemical\electrical properties
     Protein 3D Structures
A protein‟s structure has a critical effect on its function:

   3. Importance of the global fold for function
Motivation to Acquire a Structure
• Identifying active and binding sites

• Characterization of the protein‟s mechanism
  (catalysis & interactions)

• Searching for ligand of a given binding site

• Understanding the molecular basis of diseases

• Designing mutants

• Drug design

• And more...
      Determining Structure

• X-ray diffraction

• Electron Microscopy
Why predict protein structure if we
  can use experimental tools to
           determine it?
• Experimental methods are slow and expensive

• Some structures were failed to be solved

• A representative family structure can suffice to
  deduce structures of the entire family sequences
Protein databases
            Protein Sequence
          & Structure Databases
        Some of the available databases:

• RCSB- the Protein Data Bank- all deposited structures

• UniProt- main sequence database
   – SwissProt
   – Tremble

• NCBI- lots of databases, including sequence and structures

• PDBsum- combines structural & sequence data
   UniProt- Protein Sequence
• UniProt is a collaboration between the
  European Bioinformatics Institute (EBI), the
  Swiss Institute of Bioinformatics (SIB) and the
  Protein Information Resource (PIR).

• In 2002, the three institutes decided to pool
  their resources and expertise and formed the
  UniProt Consortium.
       UniProt- Protein Sequence
• The world's most comprehensive catalog of information on

• Sequence, function & more…

• Comprised mainly of the databases:

   – SwissProt –516081 entries– high quality annotation, non-
     redundant & cross-referenced to many other databases.

   – TrEMBL – 10618387 entries – computer translation of the
     genetic information from the EMBL Nucleotide Sequence
     Database  many proteins are poorly annotated since
     only automatic annotation is generated
UniProt- Protein Sequence
UniProt- Protein Sequence
         Protein Data Bank (PDB)
• The PDB archive contains information about experimentally-
determined structures of proteins, nucleic acids, and complex

• The structures in the archive range from tiny proteins and bits
of DNA to complex molecular machines like the ribosome.

• There are currently 57013 structures deposited in the PDB.
However, taking out redundant sequences (e.g. 90%) reduces
the number of structures to 19988…

• Each structure receives a unique 4 letter ID
Protein Data Bank (PDB)

                  PDB ID: 3mht
Protein Data Bank (PDB)


                   The paper describing
                       the structure

                     Data concerning the
                    resolution, R-value….
Protein Data Bank (PDB)

• A database providing an overview of all biological
  macromolecular structures

• Connected to UniProt  find the sequence accession of a
  known PDB ID

• Detailed description of many structure properties, e.g.:
  – EC number
  – Chains & ligands and their interactions
  – Clefts
  – Secondary structure
  – FASTA sequence of structure…

                                                       Free text

                                                   Search by sequence

                Useful tabs


Chains &

     Protein tab

            Secondary structure-
               from the PDB
More Sequences Than Structures

• Discrepancy between the number of known sequences and
  solved structures:

             5,047,807 UniRef90 entries vs.
           25566 90% Non-redundant structures

Computational methods are needed to
      obtain more structures
Structure prediction
 Structure Prediction Approaches
1. Homology (Comparative) Modeling
Based on sequence similarity with a protein for
which a structure has been solved.

2. Threading (Fold Recognition)
Requires a structure similar to a known structure

3. Ab-initio fold prediction
Not based on similarity to a sequence\structure
Structure prediction from “first principals”:

    Given only the sequence, try to predict the structure
            based on physico-chemical properties
                (energy, hydrophobicity etc.)

•   When all else fails  works for novel folds

•   Shows that we understand the process
               The Force Field
                    (energy function)
    A group of mathematical expressions describing the
            potential energy of a molecular system

•   Each expression describes a different type of physico-
    chemical interaction between atoms in the system:

    •   Van der Waals forces
    •   Covalent bonds
    •   Hydrogen bonds

    •   Charges

    •   Hydrophobic effects
Approaches to Ab-initio Prediction
               1. Molecular Dynamics
• Simulates the forces that governs the protein within water.
• Since proteins usually naturally fold, this would lead to the
  native protein structure.

• Thousands of atoms
• Huge number of time steps to reach folded protein
   feasible only for very small proteins
Approaches to Ab-initio Prediction
                2. Minimal Energy

    Assumption: the folded form is the minimal energy
                 conformation of a protein

 Main principals:
 • Define an energy function.
 • Search for 3D conformation that minimize energy.
• Current methods (e.g. Rosetta) primarily utilize the
  fact that although we are far from observing all
  protein folds, we probably have seen nearly all sub-

• A library of known sub-structures
 (fragments less than 10 residues) is created.

• A range of possible conformations for
  each fragment in the query protein are selected.

                         Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
Ab-initio - Example

      Moult J. Philos. Trans. R. Soc. B. 361:453–458 (2006)
        Fold Recognition (Threading):
       Sequence to structure matching
 Given a sequence and a library of folds, thread the sequence
    through each fold. Take the one with the highest score.
• Method will fail if new protein does not belong to any fold in
the library.

• Score of the threading is computed based on known
  physical chemistry properties & statistics of amino acids.

• In practice, fold recognition methods are often mixtures
of sequence matching and threading.
      Structure Prediction Approaches
              Threading: example
1. sequence
    H bond donor
   H bond acceptor

2. Library of folds of known proteins
    Threading: example
H bond donor
H bond acceptor

      S=-2        S=5     S=20
      Z= -1       Z=1.5   Z=5
            Fold recognition (threading)
       Find best fold for a protein sequence:

                               1)   ...     56)   ...      n)

                                     ...          ...

                              -10   ...    -123   ...   20.5

                                    Potential fold

We need a scoring (energy) function to distinguish native
structure from misfolded structures.

Ideally, each misfolded structure should have an energy
higher than the native energy, i.e. :Emisfolded-Enative> 0
                      Fold recognition: FFAS03

 •The FFAS03 server provides an interface to the third
 generation of the profile-profile alignment and fold recognition
 algorithm FFAS.

 • Profile-profile alignments utilize information present in
 sequences of homologous proteins to amplify the sequence
 conservation pattern defining the family

 •The result: detection of remote homologies beyond the reach
 of other sequence comparison methods.

Jaroszewski, L., Rychlewski, L., Li, Z., Li, W. & Godzik, A. (2005) FFAS03: a server for profile-profile sequence
alignments. Nucl. Acids Res. 33, W284-W288
                       Fold recognition: HHPRED

                 Profiles are based on Hidden Markov Models:

                 0.5              0.6

         0.4           0.7 0.2

                 0.3             0.6

                                           Emit Amino acid

Söding J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951-960.
                  Fold recognition: HHPRED

  • Profile Hidden Markov Models (HMMs) are similar to sequence
  profiles, but in addition to the amino acid frequencies they
  contain information about the frequency of inserts and deletions.

  • Using profile HMMs in place of simple sequence profiles should
  therefore further improve sensitivity.

  • The first to employ HMM-HMM comparison, based on a novel
  statistical method.

  • Using HMMs both on the query and the database side greatly
  enhances the sensitivity and selectivity over sequence-profile

Söding J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951-960.
    I-TASSER- Hybrid Approach

• In a recent wide blind experiment, I-TASSER
generated the best 3D structure predictions among
all automated servers.

• Based on the secondary-structure threading
and the iterative implementation of the Threading
ASSEmbly Refinement (TASSER) program.
Homology Modeling
           Homology Modeling –
               Basic Idea
1.   A protein structure is defined by
     its amino acid sequence.

2.   Closely related sequences adopt
     highly similar structures, distantly
     related sequences may still fold
     into similar structures.

3.   Three-dimensional structure of
     proteins from the same family is
                                            Triophospate ismoerases
     more conserved than their              44.7% sequence identity
     primary sequences.                     0.95 RMSD
  Homology modeling requires handling
        structures & sequences

• Query- only the protein sequence is available- usually found
  at the UniProt database

• Template- after identification, both structural and sequence-
  related data should be found- UniPort (or NCBI databases),
  RCSB and PDBsum
                           Homology modeling-
                           widespread technique
          Query protein                          Homologous protein-
           sequence                               structural template

                             Align query & template
                               protein sequences

                                  Build model

e.g. Fiser et al., 2004;
   Petrey et al., 2005;
          Zhang, 2008           Evaluate model
               General Scheme
1.   Searching for structures related to the query sequence

2.   Selecting templates

3.   Aligning query sequence with template structures

4.   Building a model for the query using information from
     the template structures                 Modeller

5.   Evaluating the model
                     Fiser A et al. Methods in Enzymology 374: 461-491(2004)
General Scheme
    1. Searching For Structures
•   Sequence search against the PDB sequences

•   Sequence-profile search

•   Threading: sequence-structure fitness function
     1. Searching For Structures
If BLAST search against the PDB fail to recognize adequate
templates, turn to fold recognition (threading) servers:

• FFAS03-


• HMAP (available through the FUDGE pipeline)-


These servers not only find optional templates, but also suggest a
pairwise alignment and in some cases even construct the 3D
         2. Selecting Templates
      How to select the right template?
•   Higher sequence similarity - %ID

•   Close subfamily - phylogenetic tree

•   “Environment” similarity - solvent, pH, ligand,
                                               Seq. 1
    quaternary interactions             Seq. 2
                                                  Seq. 3
                                                Seq. 4
•   The quality of the experimentally   determined
                                             Seq. 5
    structure                                 Seq. 6

•   Purpose of modeling - e.g. protein-ligand model vs.
    geometry of active site
          2. Selecting Templates
             More than one template

•   Two ways to combine multiple templates:

    –   Global model – alignment with different domain of
        the target with little overlap between them

    –   Local model – alignment with the same part of the
      2. Selecting Templates
         More than one template

The more the merrier -
multiple structures with
the same fold:
        2. Selecting Templates
                 Trial and error
•   Generate a model for each candidate
    template and/or their combination.

•   Evaluate the models by an energy or
    any other scoring function.
    (will be discussed later…)
         3. Aligning query and
          template sequences

• All comparative modeling programs depend on a
  target-template alignment.

• When the sequence similarity between the template
  and target proteins is high, simple pairwise alignments
  are usually fine (e.g. Needleman-Wunsch global

• Gaps or low/medium sequence similarity indicate that
  we should improve the alignment...
          3. Aligning query and
           template sequences
1.   Create a multiple sequence alignment and extract the
     template-query pairwise alignment.
Pairwise alignments – not enough!
              3. Aligning query and
               template sequences
1.       Create a multiple sequence alignment and extract the
         template-query pairwise alignment.


     •     Visual inspection of alignments - difficult to teach…
           a matter of experience…
                3. Aligning query and
                 template sequences
1.      Create a multiple sequence alignment and extract the
        template-query pairwise alignment.

2.      Use secondary structure information to improve
        pairwise alignment- avoid gaps in these regions!

          3. Aligning query and
           template sequences
1.   Create a multiple sequence alignment and extract the
     template-query pairwise alignment

2.   Use secondary structure information to improve
     pairwise alignment- avoid gaps in these regions!

3.   Biochemical and structural previous data
           3. Aligning query and
            template sequences
                 Tips for MSA building
• Where? (to find homologues)
   • Structural templates- search against the PDB
   • Sequence homologues- search against SwissProt or
   Uniprot (recommended!)- usually using BLAST

• How many?
   • As many as possible, as long as the MSA looks good
   (next week…)
           3. Aligning query and
            template sequences
                 Tips for MSA building
• How long? (length of homologues)
   • Fragments- short homologues (less than 50,60% the
   query‟s length) = bad alignment
   • Ensure your sequences exhibit the wanted domain(s)
   • N/C terminal tend to vary in length between homologues
• How close? (distance from query sequence)
   • All too close- no information
   • Too many too far- bad alignment
   • Ensure that you have a balanced collection!
           3. Aligning query and
            template sequences
                Tips for MSA building
• From who? (which species the sequence belongs to)
   • Don‟t care, all homologues are welcome
   • Orthologues/paralogues may be helpful
   • Sequences from distant/close species provide different
   types of information

• Which alignment method?
   • The best today are MUSCLE, T-Coffee and MAFFT. All
   available at
        3. Aligning query and
         template sequences
            Tips for MSA building
• Most importantly, make sure that both the query
and the selected template are included in the MSA.

• Sequences which are more distant than the template
are not needed to be included in the alignment.
            3. Aligning query and
             template sequences
         Query-template alignment
      via a profile-to-profile approach:
1. Construct an MSA for the query, serving as profiles depicting
the protein family properties.

2. Align the profile to profiles of all proteins of the PDB, using,
e.g., FFAS03 or HHpred.

3. Compare pairwise alignments constructed via the different
methods – hope to get a consensus prediction…
        3. Aligning query and
         template sequences
Different levels of similarity between the template & query
        initiate various computational approaches:
                  4. Building a model
     Once you have an improved pairwise
  alignment between your query & template

           Use Modeller to build your model!

A. Sali & T.L. Blundell. Comparative protein modelling by satisfaction of spatial
restraints. J. Mol. Biol. 234, 779-815, 1993.
              4. Building a model

    Generation and Refinement
    Using satisfaction of spatial restrains
    Can perform additional tasks:
     de novo modeling of loops
     Optimization of models – using an objective
     Multiple alignment
     Comparison of protein structures
                4. Building a model


• Other spatial features, such as
  hydrogen bonds, and dihedral angles,
  are transferred from the templates to
  the target.

• Thus, a number of spatial restraints
  on its structure are obtained.

• The 3D model is obtained by
  satisfying all the restraints as well as
  possible .
                  4. Building a model

• Distance and dihedral angle restraints on the target are
 calculated from its alignment with template.

• Restraints were obtained also from a statistical analysis of the
  relationships from a large database of pairs of homologous

• Various correlations were obtained, e.g. correlations between Ca-
  Ca distances. These relationships can be used directly as spatial

• Restraints and CHARMM energy terms are then combined into an
  objective function, which is optimized in 3D space.
         5. Model Evaluation
• The accuracy of the model depends on its
  sequence identity with the template:
         5. Model Evaluation
    The model can be assessed in two levels:

•   Global- reliability of the model as a whole.
    *Useful when several models are generated and
    one should be chosen as the best one.
    *When different models were based on various
    templates, may help choose the best one.

•   Local- assessing the reliability of the different
    regions, even specific residues, of the model.
    *Useful to detect local mistakes, that may
    originate in many time from alignment errors.
          5. Model Evaluation
        Examples of assessment approaches:

1. Assessment of the model‟s stereochemistry

2. Prediction of unreliable regions of the model -
   “pseudo energy” profile: peaks  errors

3. Consistence with experimental observations

4. Consistence with evolutionary conservation rates
5 Basic Steps
Hands ON
            The Query Protein
Name: Dihydrodipicolinate reductase

Enzyme reaction:

Molecular process: Lysine biosynthesis (early stages)

Organism: E. coli

Sequence length: 273 aa
1. Searching For Structures
    1. Searching For Structures

                 Get your sequence

   1. Searching For Structures
Find templates with significant homology:

• BLAST against the sequences in the PDB

Find also more distant templates, using profile-to-
profile approach:

  • FFAS03 server
  • HHPRED server
1. Searching For Structures
         Blast against the PDB

     1. Searching For Structures
              Blast against the PDB

                                         1. Paste

                                     2. Select the PDB


1. Searching For Structures
         Blast against the PDB

1. Searching For Structures
            Use fold recognition - FFAS03

                                             1. Paste
Select the PDB                              sequence
     1. Searching For Structures
                 Use fold recognition - HHPRED

Select the PDB                                               1. Paste
  database                                                  sequence

2. Selecting templates
2. Selecting templates
     Blast against the PDB

                             The real structure
                               of our protein

                         Closest homologous
2. Selecting templates
       Blast against the PDB

                                         The selected
                                        1VM6, chain A
    2. Selecting templates
            Use fold recognition - FFAS03
2. Selecting templates
  Use fold recognition - FFAS03

Scores below -9.5  significant
 2. Selecting templates
         Use fold recognition - HHPRED
2. Selecting templates
  Use fold recognition - HHPRED
2. Selecting templates
        Who is our template?

                                PDB ID 1VM6 is
                                 UniProt entry
3. Alignment
    3. Alignment
3. Alignment

               No model

           We will use ConSurf to
            get homologues and
               build and MSA
                3. Alignment

      Set to
     max- 500
       y                          Database;
Min. identity                  Swissprot/uniprot/
3. Alignment

               Job name
3. Alignment
3. Alignment

                          PSIBLAST result

                          Filtered sequences

    MSA- download the file- right
       click on the mouse
              Easiest Using Bioedit

• Easy-to-use sequence alignment editor

• View and manipulate alignments up to 20,000 sequences.

•Four modes of manual alignment: select and slide, dynamic grab
and drag, gap insert and delete by mouse click, and on-screen
typing which behaves like a text editor.

•Reads and writes Genbank, Fasta, Phylip 3.2, Phylip 4, and
NBRF/PIR formats. Also reads GCG and Clustal formats
 Easiest Using Bioedit
                Easiest Using Bioedit
• Find a specific sequence: “Edit-> search -< in titles”

• Erase\add sequences: “Edit-> cut\paste\delete sequence”

• “Sequence Identity matrix” under “Alignment”-
   useful for a rough evaluation of distances within the alignment.

• After taking out sequences, “Minimize Alignment” under
  “Alignment” takes out unessential gaps.

• Can save an image using:
  “File -< Graphic View” & then “Edit -< Copy page as BITMAP”

                    3. Alignment
        Extract query-template pairwise alignment

1. Open: Start  Phylogeny  BioEdit

2. Open the alignment: file  open  „query.aln‟

2. Select the template:
          Edit  Search  Find in Titles  “DAPB_THEMA”
         3. Alignment
Extract query-template pairwise alignment

                      3. Alignment
         Extract query-template pairwise alignment

4. Add the query to the template selection: ctrl + „query‟

5. Invert selection: Edit  invert title selection

6. Delete other sequences: Edit  Cut Sequences(s)

7. Minimize gaps: Alignment  Minimize Alignment

8. Save the pairwise alignment:
   File  Save as (Fasta format)  “DAPB_ECOLI_1VM6.fas”
                        3. Alignment
        Extract query-template pairwise alignment


                                       File name

Save as “fasta” format!!!!!!!
     3. Alignment
  Use fold recognition - FFAS03

Scores below -9.5  significant
                         3. Alignment
                    Use fold recognition - FFAS03
             3. Alignment
         Use fold recognition - HHPRED
    3. Alignment
Use fold recognition - HHPRED
                   3. Alignment
       Inspect query-template pairwise alignment
• Generally speaking, in this step we would compare the
  pairwise alignments computed by the three approaches:
   • MSA-derived
   • FFAS03

• We don‟t have the time/patience for that now….

• Thus, we will now edit the pairwise from the MSA- Modeller
  requires a specific format, which we have to manually adjust
                       3. Alignment
            Edit query-template pairwise alignment
                        The name of the query protein (this will
                        be the name of the modeled PDB file)
sequence:DAPB_ECOLI:1:A:274:A :::: Start, end and chain
                                        The PDB file of the template
>P1;1VM6                                  (rename DAPB_THEMA)
structureX:1VM6:1:A:212:A ::::

                       Save as “dapb_ecoli_1vm6.pir”
4. Model Building
A script for Modeller- copy to a text file….
 from modeller import *
 from modeller.automodel import *

 env = environ()

 a = automodel(env,
           alnfile = 'dapb_ecoli_1vm6.pir',
           knowns = ('1VM6'),
           sequence = 'DAPB_ECOLI')
 a.starting_model= 1
 a.ending_model = 1

4. Model Building
                                          1. Paste the
4. Model Building                        PDB ID “1VM6”
 Get the template structure

                  4. Model Building
              Get the template structure: 1vm6 chain A

  Save as:

          4. Model Building
                 Running modeller:

1. Put the PDB file, PIR alignment and modeller
    script in a specific directory, e.g. c:\test
2. Desktop  Modeller:
          4. Model Building
                Running modeller:

3. “cd c:\test”
4. “mod9v7 [modeller script name]
          4. Model Building
                Running modeller:

5. The run completed successfully:
             4. Model Building
                    Running modeller:
6. Output files:
   • Model, e.g. “P2RX1_HUMAN.B99990001.pdb”
   • Log file- very important- specifies the problems of
       the run
   • Other, not important, files

7. Open pymol and look at your model….

8. Evaluate it- tomorrow!
             4. Model Building
         Edit query-template pairwise alignment

Watch out! Modeller can fail owing to:

1. Non-matching start and end points of the template
   at the PIR alignment and PDB template file

2. Small discrepancies between the sequence of the
  template and in the PIR alignment… may have to
  manually edit the alignment a little…

This, and more, will be reported in the log file 