Document Sample
From PSI-BLAST to HMMer Powered By Docstoc
					From PSI-BLAST to

    Professor Mark Pallen

                  Stephanie Minnema University of Calgary
                     David Wishart University of Alberta
    Advanced BLAST Methods
•   The NCBI BLAST pages have several advanced BLAST
    methods available
•   All are powerful methods based on protein similarities
Position-Specific-Iterated (PSI)
• Intuition                                    • Cycling/iterative method
  -   substitution matrices should be            -   Gives increased sensitivity for
      specific to a particular site.                 detecting distantly related
      ‣   e.g. penalize alanine→glycine more
          in a helix
                                                 -   Can give insight into functional

• Idea                                               relationships

  -   Use BLAST with high stringency             -   Very refined statistical methods

      to get set of closely related
                                                 -   Fast: still based on BLAST

  -   Align sequences to create new              -   Simple to use
      substitution matrix for each

  -   Use that matrix to find
      additional sequences
        PSI-BLAST Principle
• First, a standard blastp is performed
• The highest scoring hits are used to generate a multiple
• A PSSM is generated from the multiple alignment.
  -   Highly conserved residues get high scores
  -   Less conserved residues get lower scores

• Another similarity search is performed, this time using the
  new PSSM
• Steps 2-4 can be repeated until convergence
  -   No new sequences appear after iteration
               Aminoacyl tRNA Synthetases
•   20 enzymes for 20 amino acids
•   Each is very different
    - Big, small, monomers, tetramers…
    - All bind to their appropriate tRNAs and amino acids, with high

•   TrpRS and TyrRS share only 13% sequence identity
    - BUT, overall structures of TrpTRS and TyrTRS are similar
    - Structure  Function relationship
Same SCOP family
based on catalytic

     Overall structure
     similarity noted
        So is there sequence similarity
         between TyrRS and TrpRS?
•   Given structural similarities, we would expect to
    find sequence similarity…
•   BUT!
    blastp of E.coli TyrRS against bacterial sequences in SwissProt
       does NOT show similarity with TrpRS at e-value cutoff of 10
No TrpRS!?
Try Using PSI-BLAST…
•   PSI-BLAST available from BLAST main page
•   Query form just like for blastp
    - BUT: one extra formatting option must be used
    - “Format for PSI-BLAST” – activate the tick box!
    - Second e-value cutoff used to determine which
      alignments will be used for PSSM build… “Threshold for

•   First search using TyrRS as query
    - Db = SwissProt; limit = Bacteria [ORGN]
    - Threshold for inclusion = 0.005
After A Few Iterations…
TyrRS Similarity to TrpRS!
         Power of PSI-BLAST
•   We knew TyrRS and TrpRS were similarly
    - Functionally and structurally
•   BLASTP gave no indication
    - PSI-BLAST was able to detect their weak sequence similarity
•   Words of caution:
    - be sure to inspect and think about the results included in
      the PSSM build
    - include/exclude sequences on basis of biological
      knowledge: you are in the driving seat!
    - PSI-BLAST performance varies according to choice of
      matrix, filter, statistics etc just like BLASTP
      Why (not) PSI-BLAST
•   If the sequences used to construct the Position
    Specific Scoring Matrices (PSSMs) are all homologous,
    the sensitivity at a given specificity improves
•   However, if non-homologous sequences are included
    in the PSSMs, they are “corrupted.” Then they pull in
    more non-homologous sequences, and become
    worse than generic
           Does the query
            really have a
        relationship with the

        One way to check is
        to run the search in
            the opposite
         …but maybe not
          reversible even
        when true homology
           PSI-BLAST caveats
•   Increased ability to find distant homologues

•   Cost of additional required care to prevent non-
    homologous sequences from being included in the PSSM
    -   When in doubt, leave it out!

    -   Examine sequences with moderate similarity carefully.

•   Be particularly cautious about matches to sequences with
    highly biased amino acid content
    -   Low complexity regions, transmembrane regions and coiled-coil regions often
        display significant similarity without homology

    -   Screen them out of your query sequences!
                      on the command line
•   as with simple BLAST searches, using PSI-BLAST on the
    command line gives the user more power
•   opens up additional options, e.g.
    - PSI-BLASTing over nucleotide databases
    - automating number of iterations
    - trying out lots of different settings in parallel
    - inputting multiple sequences
•   Pattern Hit Initiated – BLAST
•   PHI-BLAST principle:
    - Same method as PSI-BLAST
    - Starts first search with query sequence + pattern for a
      motif in the query

•   PHI-BLAST finds sequences containing the motif
    and having significant sequence similarity in the
    vicinity of the motif occurrence
    - Highly specific
               Example: TyrRS
•   TyrRS contains the aaRS class-I signature
•   Want to find sequences containing that motif,
    and regional similarity to TyrRS
•   First: get the Prosite pattern for the class-I
    - Prosite = database of protein families and domains

                            aminoacyl-transfer RNA
 Insert Query

Insert PHI Pattern
           PHI-BLAST Results
•   After first search, PHI-BLAST functions same as
•   Result page is the same
•   Can iterate in same way

•   Try it later if you like…
        Building on PSI-BLAST
•   PSI-BLAST generates the multiple alignments to
    create PSSMs
    - Refines scoring in searches
•   Annotated collections of multiple alignments
    defining domains exist
    - Conserved domain database (CDD)
    - Contains 18039 alignments (10013 last year)
•   Can search the CDD using CD search
    - Uses RPS-BLAST
•   Reverse Position Specific – BLAST
    - Opposite of PSI-BLAST
•   CDD multiple alignments converted to PSSMs
•   PSSMs are processed and turned into a searchable
•   Queries are searched against PSSMs using RPS-BLAST
•   Output indicates conserved domains within the query
Example: CRADD
        Click on picture to
        see CDD multiple


             Click to see
        alignment with query
Profile Hidden Markov Models
•   statistical models of multiple sequence alignments
•   capture position-specific information about
    - how conserved each column of the alignment is
    - which residues are likely
    - use position-specific scores for amino acids (or nucleotides)
    - position specific penalties for opening and extending an
      insertion or deletion.
Applications of profile HMMs
•   Database searching for weak homologies
    - Alternative to PSI-BLAST
•   Automated annotation of the domain structure of
Applications of profile HMMs
• Useful for organizing sequences into evolutionarily related
• Databases like Pfam constructed by distinguishing between
  - a stable curated “seed” alignment of a small number of
      representative sequences

  -   “full” alignments of all detectable homologs

• HMMER used to
  - make a model of the seed
  - search the database for homologs
  - automatically produce the full alignment by aligning every
      sequence to the seed consensus
Constructing a profile HMM
•   multiple sequence alignment is made of known members of a given
    protein family
    -   quality of alignment, number and diversity of the sequences crucial for

•   profile HMM of family built from the alignment
    -   model-building program uses the alignment together with its prior
        knowledge of the general nature of proteins

•   model-scoring program used to assign a score with respect to the
    model to any sequence of interest
    -   better the score, the higher the chance that query sequence is
        homologous to protein family in the model.

    -   each sequence in a database scored to find the members of the family
        present in the database.
        Profile HMM programs
•   developed by Sean Eddy
•   freely available under GNU General Public License
•   includes model-building and model-scoring programs
    relevant to homology detection
•   contains a program that calibrates a model by
    - scoring it against a set of random sequences
    - fitting an extreme value distribution to the resultant raw scores
    - parameters of this distribution then used to calculate
      accurate E-values for sequences of interest.
                            Programs in
                       the HMMER 2 package

•   hmmalign                                           •   hmmemit
    -   Align sequences to existing model                  -   Emit sequences probabilistically from a profile

•   hmmbuild
                                                       •   hmmfetch
    -   Build a model from multiple sequence
        alignment.                                         -   Get a single model from an HMM database.

•   hmmcalibrate                                       •   hmmindex
    -   Takes an HMM and empirically determines            -   Index an HMM database.
        parameters used to make searches more
        sensitive by calculating more accurate E-
        values                                         •   hmmpfam
•   hmmconvert
                                                               Search an HMM database for matches to a
                                                               query sequence.

    -   Convert a model file into different formats,
        including a compact HMMER 2 binary format,     •   hmmsearch
        and “best effort” emulation of GCG profiles.
                                                           -   Search a sequence database for matches to
                                                               an HMM.
     Advantages of HMMs
•   HMMs have a formal probabilistic basis
    - use probability theory to guide how all the scoring parameters
      should be set

    - can do things that more heuristic methods cannot do easily
      ‣ For example, a profile HMM can be trained from unaligned
         sequences, if a trusted alignment isn’t yet known

•   HMMs have a consistent theory behind gap and
    insertion scores
    Advantages of HMMs
•   In most details, profile HMMs are a slight improvement
    over a carefully constructed profile
    - but less skill and manual intervention are necessary to use
      profile HMMs

•   HMMs can produce true global alignments, unlike
Limitations of HMMs
•   do not capture any higher-order correlations
    - assumes that the identity of a particular position is
      independent of the identity of all other positions
      ‣ make poor models of RNAs because an HMM cannot
         describe base pairs.

    - c.f. protein “threading” methods
      ‣ which usually include scoring terms for nearby amino acids
         in a three-dimensional protein structure.

•   slower than and less user-friendly than PSI-BLAST
    - fast sensitive method for finding distant homology
    - iterative, uses PSSMs
    - but GIGO!
    - alignments used to create CDD database
•   HMMer
    - alternative to PSI-BLAST for finding distant homologies
    - but more cumbersome
    - alignments used to create PFAM database