Psi-blast by yurtgc548


									Protein Structure and
 Function Prediction
    Predicting 3D Structure
      Outstanding difficult problem

– Comparative modeling (homology)

– Fold recognition (threading)
       Comparative Modeling
Similar sequence suggests similar structure

Comparative structure prediction
produces an all atom model of a
sequence, based on its alignment to one
or more related protein structures in the
          Comparative Modeling
Modeling of a sequence based on known structures
Consist of four major steps :
1. Finding a known structure(s) related to the sequence
   to be modeled (template), using sequence comparison
   methods such as PSI-BLAST
2. Aligning sequence with the templates
3. Building a model
4. Assessing the model
        Comparative Modeling
• Accuracy of the comparative model is
  related to the sequence identity on which it is
 >50% sequence identity = high accuracy
30%-50% sequence identity= 90% modeled
<30% sequence identity =low accuracy (many errors)
• Similarity particularly high in core
  – Alpha helices and beta sheets preserved
  – Even near-identical sequences vary in loops
Comparative Modeling Methods
MODELLER (Sali –Rockefeller/UCSF)
SCWRL (Dunbrack- UCSF )
                 Protein Folds

• A combination of secondary structural units
   – Forms basic level of classification
• Each protein family belongs to a fold
   – Estimated 1000–3000 different folds
   – Fold is shared among close and distant family
   • Different sequences can share similar folds
Protein Folds: sequential and spatial
arrangement of secondary structures

  Hemoglobin                TIM
Fold classification:
      All alpha
      All beta
Basic steps in Fold Recognition :
 Compare sequence against a Library of all known Protein Folds (finite number)

  Query sequence

  Goal: find to what folding template the sequence fits best

            Find ways to evaluate sequence-structure fit
    Find best fold for a protein sequence:
         Fold recognition (threading)

                             1)   ...     56)   ...      n)

                                   ...          ...

MAHFPGFGQSLLFGYPVYVFGD...   -10   ...    -123   ...   20.5

                                  Potential fold
    Programs for fold recognition
•   TOPITS (Rost 1995)
•   GenTHREADER (Jones 1999)
•   3D-PSSM
          Ab Initio Modeling
• Compute molecular structure from laws of
  physics and chemistry alone
  – Ideal solution (theoretically)
• Simulate process of protein folding
  – Apply minimum energy considerations
• Practically nearly impossible
  – Exceptionally complex calculations
  – Biophysics understanding incomplete
          Ab Initio Methods
• Rosetta (Bakers lab, Seattle)

• Undertaker (Karplus, UCSC)
         PART 2
Predicting Protein Function
    Inferring protein function :
• Based on the existence of known protein
• Based on homology
            Protein Domains
• Domains can be considered as building blocks
  of proteins.

• Some domains can be found in many proteins
  with different functions, while others are only
  found in proteins with a certain function.

• The presence of a particular domain can be
  indicative of the function of the protein.
DNA Binding domain
Protein Domain can be defined by :
      • A motif
      • A profile (PSSM)
      • A Hidden Markov Model

Profile Scoring
• ProSite is a database of protein domains that
  can be searched by either regular expression
  patterns or sequence profiles.

   Profile HMM (Hidden Markov Model)
         HMM is a probabilistic model of the MSA consisting
         of a number of interconnected states
             D16           D17     100% D18              D19
                                             100%                16 17 18 19
             M16           M17            M18            M19      DRTR
             D 0.8   50%   P 0.4
                                   100%           100%   R 0.4
Match        S 0.2         R 0.6
                                          T 1.0
                                                         S 0.6    S - - S
                                                                  SP TR
                                                                  DR TR
             I16           I17             I18           I19      DP TS
insert        X             X                X            X
                                                                  D - - S
                                                                  D - - S
                                                                  D - - S
                                                                  D - - R
• Database that contains a large collection
of multiple sequence alignments and
Profile hidden Markov Models (HMMs).
• High-quality seed alignments are used to
build HMMs to which sequences are aligned

• The Pfam database is based on two
  distinct classes of alignments
  – Seed alignments which are deemed to be
    accurate and used to produce Pfam A
  – Alignments derived by automatic clustering of
    SwissProt, which are less reliable and give rise
    to Pfam B
  Was built from protein
  classification databases, such as:
     • PROSITE
     • ProDom
     • SMART
     • Pfam
     • PRINTS

Uses UniProt = SWISSPROT and TrEMBL
     Database and Tools for protein
         families and domains
• InterPro - Integrated Resources of Proteins Domains and Functional
• Prosite – A dadabase of protein families and domain
• Pfam - Protein families db (HMM derived)
• PRINTS - Protein Motif fingerprint db
• ProDom - Protein domain db (Automatically generated)
• PROTOMAP - An automatic hierarchical classification of Swiss-Prot
• SBASE - SBASE domain db
• SMART - Simple Modular Architecture Research Tool
• TIGRFAMs - TIGR protein families db
Inferring protein function based on
        sequence homology
  Clusters of Orthologous Groups of proteins
   Classification of conserved genes according to their
   homologous relationships. (Koonin et al., NAR)
Homologs - Proteins with a common evolutionary origin

Orthologs - Proteins from different species that evolved by vertical
       descent (speciation).

Paralogs - Proteins encoded within a given species that arose from one or
       more gene duplication events.
Clusters of Orthologous Groups of proteins

Each COG consists of individual orthologous
  proteins or orthologous sets of paralogs from
  at least three lineages.

Orthologs typically have the same function,
 allowing transfer of functional information
 from one member to an entire COG.
       COGS - Clusters of orthologous groups
* All-against-all sequence comparison of the proteins encoded in
completed genomes (paralogs/orthologs)

* For a given protein “a” in genome A, if there are several similar
proteins in genome B, the most similar one is selected

* If when using the protein “b” as a query, protein “a” in genome A
is selected as the best hit “a” and “b” can be included in a COG

* Proteins in a COG are more similar to other proteins in the COG
than to any other protein in the compared genomes

* A COG is defined when it includes at least three homologous
proteins from three distant genomes

To top