Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

HMM as a profile by qO4RDQ31

VIEWS: 4 PAGES: 69

									                                       Multiple Sequence Alignment
                                                  (MSA)
Bioe 144/244
Introduction to Protein Informatics




                                                                                                   1
                              Figure: SATCHMO alignment (and tree) for beta adrenergic receptors
Bioe 144/244
Introduction to Protein Informatics
                                                     Source material
          •      Creighton, Proteins. Chapter 3: Evolutionary and Genetic Origins of
                 Protein Sequences. pp 105-137
                   – Presents the evolutionary mechanisms by which protein families develop
                     new folds and functions, and the functional and structural roles of individual
                     positions in molecules. Foundational material for all aspects of protein
                     informatics.

          •      Recommended:
                   – Thompson, Plewniak and Poch, “A comprehensive comparison of multiple
                     sequence alignment programs” NAR 1999
                            • Presents the BAliBASE benchmark alignment dataset and results of different
                              methods. Since it was published almost 10 years ago, it doesn’t include methods
                              developed more recently, but it’s still relevant. BAliBASE was the first benchmark
                              for multiple alignment accuracy and is still in use.
                   – Geoffrey Barton, in Bioinformatics: A practical guide to the analysis of
                     genes and proteins (course text). Chapter 12: Creation and Analysis of
                     Protein Multiple Sequence Alignments. pp 325-336
                            • Excellent overview of many issues in multiple sequence alignment and
                              interpretation by one of the top people in the field.


                                                                                                                   2
Bioe 144/244
                                      What is an MSA?
Introduction to Protein Informatics




      An MSA is an assertion of homology across >2 nucleotide or amino acid
      sequences. An MSA asserts that:
      • Sequences have a common ancestor
      • Characters in columns have descended from an ancestral character
           • The exception: indel characters represent deletion/insertion
      •The impact of these assertions not being supported by the alignment may be
      large or small, depending on the intended use
           • alignment editing/masking is employed to ameliorate conflicts between the
           data and assumptions

      An MSA is a matrix M, with c columns and r rows.
      • Mi,j = the character for sequence i at column j.
      • In an MSA for proteins, characters are drawn from the set {a, c, d, e, …y, A, C,
      …Y, ., -}
            • Lower-case characters may have different meanings from upper-case
            characters
            • Dot (.) and dash (-) are indel characters (and may have different meanings
            from each other in some cases                                              3
Bioe 144/244
Introduction to Protein Informatics
                                                          Uses of sequence alignment



                                                                           Phylogenetic tree


             Active and binding                                                            GFKLP
             site prediction                                                               GYKLP                 Homology Models
                                                                                           GFRVP
                                                                                           GF-LP
                                                                                                                    And more…
         Profiles/HMM construction
                                                                                             Domain Prediction
                                                                                             Substitution Matrices
       QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.     Function prediction by phylogenomic analysis
                                                                                             Secondary structure prediction
                                                                                             Subfamily identification                   4
                                                                                             …
                                      Domain shuffling and gene
Bioe 144/244
Introduction to Protein Informatics     fusion/fission events

                                                                       Leucine-Rich
                                                                       Repeat (LRR)




                                                                       Toll-Interleukin
                                                                       Receptor (TIR)
                                                                       domain




      Sequences containing domains found in different types of proteins complicate
                                                                                     5
      homolog detection and function prediction
                                         Protein superfamilies exhibit
Bioe 144/244
Introduction to Protein Informatics   structural and functional diversity




                                                                       6
Bioe 144/244
Introduction to Protein Informatics
                                      Local, glocal and global-global

      • Local-local
                Best for boosting remote homolog detection, identification of
                evolutionary domains. Default protocol of BLAST and PSI-BLAST.


      • Global-local (aka glocal)

                  Global to the query, potentially local to the hit. Best for gathering
                  homologs to a structural domain.

      • Global-global
          Restrict sequences to those appearing to have the same domain
          architecture. Recommended for phylogenomic inference of molecular               7
          function. Default protocol of FlowerPower and PhyloBuilder.
                                      SAM a2m format
Bioe 144/244
Introduction to Protein Informatics




      The UCSC SAM HMM software uses a specialized
      format for alignments, to describe how a sequence
      was emitted by an HMM (or, equivalently, aligns to an
      HMM). a2m format MSA columns are of two types:
      •Columns consisting of upper-case characters and
      dashes correspond to nodes in the HMM representing
      the consensus structure
            •Dashes are placed to indicate passage through
            an HMM skip/delete state
            •I.e., a dash indicates a sequence does not have
            the consensus structure at that position
      • Columns consisting of lower-case characters and
      dots correspond to residues emitted in HMM insert
      states, representing inserts between positions in the
      consensus structure
            • Dots are inserted post-hoc so that all
            sequences in the MSA have the same number of
            characters.
            •Dots in one sequence indicate that another
            sequence has inserted characters using an
            insert state at that position



                                                               8
Bioe 144/244
Introduction to Protein Informatics
                                      Some alignments are easy




                           Note: no gaps, high sequence identity.



                                                                    9
Bioe 144/244
Introduction to Protein Informatics
                                      Gaps cause problems




                                                            10
Bioe 144/244
Introduction to Protein Informatics
                                         Caveats
         • Sequence “signal” guides the alignment
         • If the signal is weak, the alignment is likely to be poor
         • As proteins diverge from a common ancestor, their
           structures and functions can change
                  – All methods perform poorly when pairs are included with <25%
                    identity
                  – Even structural superposition can be challenging!
         • Unequal numbers of repeats, domain shuffling, large
           insertions or deletions introduce significant alignment
           errors
         • Take care in selecting sequences
         • See Geoff Barton’s recommendations on iterative
           alignment editing and re-alignment (from text)                      11
                                      Tree construction and alignment
Bioe 144/244
Introduction to Protein Informatics          are closely linked




                                                                   12
Bioe 144/244
Introduction to Protein Informatics
                                      Tree and alignment accuracy

            • Alignment methods are assessed for
              accuracy relative to structural alignment

            • Tree methods are assessed via
              simulation studies (which do not
              adequately assessed the kind and
              degree of variability observed in protein
              families)

                                                              13
                                      Structural alignment is the gold
Bioe 144/244
Introduction to Protein Informatics
                                                 standard

               • Structural superposition of two PDB structures
                 provides correspondences/equivalences between
                 residues
               • Since primary sequence diverges more rapidly than
                 3D structure, structural alignment is the gold standard
                 against which sequence alignment is assessed
               • Not all structural aligners agree on all pairs
               • However, clearly superposable pairs (within 2.5
                 Angstroms) are normally agreed upon by structural
                 aligners
               • Example structural aligners include: CE, DALI, VAST,
                 Structal

                                                                      14
Bioe 144/244
                                      Structural alignment example
Introduction to Protein Informatics


                                               ID   EC      Function
                                               1E9Y 3.5.1.5 Urease
                                               1J79 3.5.2.3 Dihydroorotase




                                                 Identity            9.8%
                                                 Equivalent          40%
                                                 Residues             15
                                      VAST Structural Alignment
Bioe 144/244
Introduction to Protein Informatics
                                              at NCBI




                  Type in the PDB structure ID of interest.

                               http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
                                         Protein superfamilies exhibit
Bioe 144/244
Introduction to Protein Informatics   structural and functional diversity




                                                                      17
Bioe 144/244
Introduction to Protein Informatics




      Then
      select
      PDB
      structures
      for which
      you want
      to see a
      structural
      alignment




                                      18
Bioe 144/244
Introduction to Protein Informatics
                                      VAST alignment of selected structures




                           VAST multiple alignment based on structure superposition:
                                 Non-equivalent positions are in lower-case


                                                                           1SN4 Scorpion
                                                                         neurotoxin (colored
                                                                           manually using
                                                                          PhyloFacts JMOL
                                                                          viewer to display
                                                                           non-equivalent 19
                                                                             positions)
                                      Sequence and structural divergence
Bioe 144/244
Introduction to Protein Informatics             are correlated
         Accuracy of sequence alignment relative to structural alignment
                                                    Pairwise alignment              MSA-pw
                           %ID  #pair %Superpos      BLAST ClustalW Tcoffee ClustalW MAFFT   M
                          >70     107     90.6         0.954    0.955 0.955   0.955    0.954
                          50-70    63     87.2         0.862    0.903 0.894   0.901    0.919
                          40-50    46     83.4         0.824    0.872 0.855   0.856    0.862
                          30-40    65     85.4         0.811    0.874 0.867    0.87    0.892
                          25-30    41     82.1         0.779    0.782 0.788   0.795    0.837
                          20-25    53     77.9         0.612    0.599 0.627   0.633    0.678
                          15-20    84        73        0.381    0.451 0.457    0.49    0.496
                          10-15   151     64.4          0.16    0.186 0.234   0.302     0.35
                          5-10    204     50.4        -0.007  -0.014      0  -0.047    0.098
                          0-5     122     39.5        -0.033  -0.049 -0.051  -0.034   -0.024

    Left three columns show results of structural alignment
    %ID: Structure pairs have been placed into bins based on sequence identity given the structural alignment
    #pair: number of pairs in each bin
    %Superpos: percent positions that are within ~3Angstroms RMSD (between backbone C-alpha carbons)

    Right three columns give Cline Shift scores for pairwise sequence alignments relative to the structural
    alignment. The best CS score possible is 1; negative scores indicate incorrect over-alignment with very few
                                                                                                         20
    (or no) correctly aligned residue pairs.
Bioe 144/244  SATCHMO: Simultaneous Alignment and Tree
Introduction to Protein Informatics

                        Construction using
                     Hidden Markov mOdels


                                                                         Xia Jiang
                                                                   Nandini Krishnamurthy
                                                                      Duncan Brown
                                                                       Michael Tung
                                                                    Jake Gunn-Glanville
                                                                        Bob Edgar




      Edgar, R., and Sjölander, K., "SATCHMO: Sequence Alignment and Tree Construction using
      Hidden Markov models," Bioinformatics. 2003 Jul 22;19(11):1404-11                 21
Bioe 144/244
Introduction to Protein Informatics
                                      SATCHMO motivation
         • Structural divergence within a superfamily means that…
                  – Multiple sequence alignment (MSA) is hard
                  – Alignable positions varies according to degree of divergence
         • Current MSA methods not designed to handle this
           variability
                  – Assume globally alignable, all columns (e.g. ClustalW)…
                           • Over-aligns, i.e. aligns regions that are not superposable
                  – …or identify and align only highly conserved positions (e.g., SAM
                    software with HMM “surgery”)

         • Challenge
                  – Different degrees of alignability in different sequence pairs,
                    different regions
                  – Masking protocols are lossy: loop regions may be variable across
                    the family but may be critical for function!
                                                                                          22
                                      SATCHMO algorithm
Bioe 144/244
Introduction to Protein Informatics




               • Input: unaligned sequences
               • Initialize: a profile HMM is constructed for each
                 sequence.
               • While (#clusters > 1) {
                        – Use profile-profile scoring to select clusters to join
                        – Align clusters to each other, keeping columns fixed
                        – Analyze joint MSA to predict which positions appear to be
                          structurally similar; these are retained, the remainder are masked.
                        – Construct a profile HMM for the new masked MSA
                        }
               • Output: Tree and MSA



                                                                                            23
Bioe 144/244
Introduction to Protein Informatics




                                      24
Bioe 144/244
Introduction to Protein Informatics




                                      25
Bioe 144/244
Introduction to Protein Informatics




                                      26
Bioe 144/244
Introduction to Protein Informatics




                                      27
                                 Alignment of proteins with
                                   different overall folds
Bioe 144/244
Introduction to Protein Informatics




                                                              28
                             Assessing sequence alignment
                         with respect to structural alignment

                             Xia Jiang     Duncan Brown Nandini Krishnamurthy

                                    Alignment accuracy as a function of % ID
                                  (including homologs, full-length sequences)
                    1

                   0.9

                   0.8
Average CS score




                   0.7

                   0.6
                   0.5

                   0.4

                   0.3

                   0.2

                   0.1

                    0
                         10-15%        15-20%       20-25%        25-30%         30-35%   35-40%
                                                     Percent ID
                                         CLUSTALW   MUSCLE   MAFFT         SATCHMO
Bioe 144/244
Introduction to Protein Informatics
                                      Multiple alignment methods
         • Main classes of methods
                  – Progressive:
                           • Once a sequence is aligned, that alignment will not change
                           • Sequences are typically (but not always) aligned using some
                             pre-specified order (e.g., based on a guide tree);
                           • Examples: ClustalW, SATCHMO
                           • Note: not all progressive alignment methods use a guide tree to
                             determine the order in which sequences are aligned
                  – Iterative:
                           • Alignments are refined during each iteration; allows sequences
                             to adjust their alignments
                           • Examples: FlowerPower (if sequences are realigned in each
                             iterations),
                  – Mixed (including progressive and iterative approaches)
                           • Example: MUSCLE, ProbCons
                  – HMM approaches (SAM buildmodel, and align2model)
                                                                 30
Bioe 144/244
Introduction to Protein Informatics
                                        Alignment editing

      • Alignment editing
         – Based on the intended use of the alignment
         – Removing outlier sequences or fragments
                        • Removal of non-homologs important for profile/HMM construction
                        • Some tree methods are sensitive to how gaps are handled
                                 – Including fragmentary matches can have unexpected impact
               – Removing columns
                        • Masking variable regions is important for phylogenetic tree
                          construction
                        • Deleting very gappy regions can be critical for HMM performance
               – Making the alignment non-redundant
                        • May be important for some profile/HMM methods
                                 – Sequence weighting can sometimes accomplish the same objective
                                 – E.g., UCSC SAM w0.5 handles these issues for the user   31
Bioe 144/244
Introduction to Protein Informatics
                                      Master-slave alignment


               • One sequence is the “master”
               • Other sequences (“slaves”) are aligned to the
                 master
               • Examples: BLAST, PSI-BLAST
               • You can also get this result using HMM
                 methods
                        – Construct an HMM (e.g., using w0.5)
                        – Align putative homologs to the HMM (e.g., using
                          align2model)

                                                                            32
Bioe 144/244
Introduction to Protein Informatics
                                      Task 1: Phylogenomic inference
 • Phylogenomic inference of protein function
          - FlowerPower for clustering sequences sharing same domain
            architecture (or manual selection)
          - Re-align using preferred method
                   - FlowerPower provides MUSCLE realignment
          - Mask variable or gappy columns
          - Construct tree

 • Or, use the PhyloBuilder software




                                                                       33
             http://phylogenomics.berkeley.edu/phylobuild/
                                      Task 2: MSA construction/analysis for remote
Bioe 144/244
Introduction to Protein Informatics
                                       homolog detection and structure prediction

 • To predict structural domains
          – Include locally aligning proteins in MSA
                   • Best: construct HMM for global homologs; align local matches using
                             local-local alignment (SAM -sw 2)
          – Examine alignments displayed in BLAST and PSI-BLAST
          – Submit sequences to domain prediction servers


                                                                          2                                                                   Calcium-activated
                                                                                                                                              K+ channels
                                                                        1.5

                                                       Log likelihood     1
                                                                                                                                              Cyclic-nucleotide-
                                                                        0.5                                                                   gated K+ channels

                                                                          0
                                                                               1
                                                                                   67
                                                                                        133
                                                                                              199
                                                                                                    265
                                                                                                          331
                                                                                                                397
                                                                                                                      463
                                                                                                                            529
                                                                                                                                  595
                                                                                                                                        661
                                                                        -0.5                                                                  Kinases, catabolite
                                                                                                                                              gene activator
                                                                                                                                              protein, PDB
                                                                         -1                                                                   structures
                                                                                                    HMM nodes                                 (1CGPA,etc.)



                                                                         DAPHNE analysis of K+ channel alignment
                                                                         (including partial homologs)       34
                                      BLAST and CDD results can help identify
Bioe 144/244
Introduction to Protein Informatics            conserved domains




                                                                         35
                                         Compare results from BLAST and domain
Bioe 144/244
Introduction to Protein Informatics
                                      prediction servers to refine domain boundaries
                                               PFAM: TIR ends at 142 (also finds LRR; see next slide)
                                               SMART: TIR ends at 145
                                               PhyloFacts: TIR ends at 156 (finds LRR, but not NB-ARC)
                                               BLAST: N-terminal match ends at 169




                                                                         PhyloFacts results




          NCBI CDD SMART results
                                                                              BLAST hit
    NCBI CDD includes models from SMART and PFAM.                                                       36
    Different resources represent domains differently (often crop domains to highly conserved regions).
                                      Include domain prediction results
Bioe 144/244
Introduction to Protein Informatics          from other servers




                                                                      37
                                      Using multiple alignment as a tool
Bioe 144/244
Introduction to Protein Informatics   for structure prediction
                                                                               Since structure is
                                                                               (largely) conserved,
                                                                               sequence homologs
                                                                               from the MSA can also
                                                                               be submitted to
                                                                               structure prediction
                                                                               servers
                                                                               Alignment analysis to
                                                                               identify (apparently)
                                                                               critical residues for the
                                                                               family can be used to
                                                                               judge possible
                                                                               homologs in PDB (i.e.,
                                                                               check the alignment of
                                                                               the PDB sequence
                                                                               against the consensus
                                                                               for the family)
      Predicting Protein Structure Using Hidden Markov Models.
      Proteins: Structure, Function and Genetics. Suppl. 1. (1997),                              38
      Karplus, Sjölander, Barrett, Cline, Haussler, Hughey, Holm and Sander.
                                      Using sequences from an MSA for
Bioe 144/244
Introduction to Protein Informatics   structure prediction


                                  Homolog identification

                                          Since structure is (largely) conserved, sequence homologs
                                          from the MSA can also be submitted to structure prediction
                                          servers
                                          Some of these homologs may function as intermediate
                                          sequences between the target and a solved structure
                                          Servers to try: 3d-pssm, PHYRE, Superfamily, PhyloFacts
                                          Cautions/comments:
                                          Restrict sequence homologs to the regions where they
                                          align to the target!
                                          Not all servers are kept current with PDB, so not all
                                          structures will be represented by a server
                                                                                                  39
Bioe 144/244
Introduction to Protein Informatics
                                      Task 3: Domain-specific alignment

      For an individual domain (found in multi-domain proteins)
      • Cluster and align sequences using master-slave approach
        (PSI-BLAST, SAM-T2K or FlowerPower)
      • Re-align subsequences using preferred alignment method
      • Edit to remove outlier sequences

      Phylogenetic tree construction
      • Mask variable or gappy columns

      Profile/HMM construction
      • Mask very gappy columns




                                                                   40
Bioe 144/244
                                  Task 4: Species tree construction
Introduction to Protein Informatics




 • To reconstruct a species tree
          – Select orthologs from many
            species (manual)
          – Align using preferred
            alignment method
          – Alignment masking
          – N.B. Don’t rely on results
            from just one protein or gene
            family!




                                                               41
Bioe 144/244
Introduction to Protein Informatics
                                      Alignment Editing




            Belvu allows:
             Coloring columns according to characteristics
             Changing sequence order (by %ID, tree topology)
             Deleting columns (specified range or characteristics)
             Deleting sequences individually, or according to characteristics
               (fraction gaps, low %ID)
             And more…                                                          42
Bioe 144/244
Introduction to Protein Informatics




                                            BAliBASE

                                         Julie Thompson,
                                         Frédéric Plewniak
                                      and Olivier Poch (1999)
                                      Bioinformatics, 15, 87-88
Bioe 144/244
Introduction to Protein Informatics




       http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/   44
                                              Motivation
                                      (paraphrased from the authors)
Bioe 144/244
Introduction to Protein Informatics




         • The alignment of protein sequences is a crucial tool in
           molecular biology and genome analysis.
         • Historically, the quality of new alignment programs has
           been compared to previous methods using a small
           number of test cases selected by the program author.
           [italics mine]
         • Comparisons have been done using a set of alignments
           selected from structural databases.
                  – These databases assemble proteins into homologous families, but
                    alignments are not classified specifically for the systematic
                    evaluation of multiple alignment programs.


                                                                              45
          http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/
Bioe 144/244
Introduction to Protein Informatics
                                      BAliBASE Motivation (cont’d)
      • It has been shown (McClure et al., 1994) that the performance of
        alignment programs depends on
               – the number of sequences,
               – the degree of similarity between sequences and
               – the number of insertions in the alignment.
      • Other factors may also affect alignment quality
               –    length of the sequences
               –    existence of large insertions
               –    N/C-terminal extensions
               –    over-representation of some members of the protein family.
      • We have constructed BAliBASE (Benchmark Alignment dataBASE)
        containing high-quality, documented alignments to identify the strong
        and weak points of the numerous alignment programs now available.




          http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/               46
Bioe 144/244
                                      Reference alignment
                                         construction
Introduction to Protein Informatics




      • The sequences included in the database are selected from alignments
        in either the FSSP or HOMSTRAD structural databases, or from
        manually constructed structural alignments taken from the literature.
      • When sufficient structures are not available, additional sequences are
        included from the HSSP database (Schneider et al., 1997).
      • The VAST Web server (Madej, 1995) is used to confirm that the
        sequences in each alignment are structural neighbours and can be
        structurally superimposed.
      • Functional sites are identified using the PDBsum database
        (Laskowski et al., 1997)
      • Alignments are manually verified and adjusted, in order to ensure that
        conserved residues are aligned as well as the secondary structure
        elements.
          http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/        47
Bioe 144/244
Introduction to Protein Informatics
                                      Determining “core blocks”
         • A frequent problem encountered when using reference alignments
           has been the effect of ambiguous regions in the sequences which
           cannot be structurally superposed.

         • Very distantly related sequences often have only short conserved
           motifs in long regions of low overall similarity. These regions can
           only be aligned arbitrarily in the reference and may lead to a bias in
           the comparison of programs.

         • In BAliBASE, we have annotated the core blocks in the alignments
           that only include the regions that can be reliably aligned.

         • The blocks exclude regions where there is a possibility of ambiguity
           in the alignment. This may be an important factor affecting the
           significance of statistical comparisons of alignment programs.

          http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/             48
Bioe 144/244
Introduction to Protein Informatics
                                      Reference alignments
         •     Reference 1 contains alignments of < 6 equidistant sequences (i.e. the
               pairwise identity is within a specified range). All the sequences are of similar
               length, with no large insertions or extensions.

         •     Reference 2 aligns up to three "orphan" sequences (less than 25%
               identical) from reference 1 with a family of at least 15 closely related
               sequences.

         •     Reference 3 consists of up to 4 subgroups, with less than 25% residue
               identity between sequences from different groups. The alignments are
               constructed by adding homologous family members to the more distantly
               related sequences in reference 1.

         •     Reference 4 is divided into two sub-categories containing alignments of up
               to 20 sequences including N/C-terminal extensions (up to 400 residues),
               and insertions (up to 100 residues).




          http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE/                         49
Bioe 144/244
Introduction to Protein Informatics




                                      50
Bioe 144/244
Introduction to Protein Informatics
                                      Some results
                                             <25% ID




                                               <25% ID




                                                         51
Bioe 144/244
Introduction to Protein Informatics




                                      52
Bioe 144/244
Introduction to Protein Informatics




                                      53
Bioe 144/244
Introduction to Protein Informatics
                                      Reference 3




                                                    54
Bioe 144/244
Introduction to Protein Informatics
                                      Reference 4




                                                    55
Bioe 144/244
Introduction to Protein Informatics
                                      Reference 5




                                                    56
Bioe 144/244
Introduction to Protein Informatics




                     http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/   57
Bioe 144/244
Introduction to Protein Informatics




                     http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/   58
Bioe 144/244
Introduction to Protein Informatics




                     http://bips.u-strasbg.fr/en/Products/Databases/BAliBASE2/   59
Bioe 144/244
Introduction to Protein Informatics
                                      BAliBASE Summary

          • BAliBASE was the first benchmark dataset with specific
            reference alignments to assess relative method performance
            under different types of conditions
          • While many of the sequences included in the reference
            alignments have been aligned structurally, alignments have
            been refined manually
          • It has been extended by the authors to include additional types
            of alignment tasks
          • It remains one of the most commonly used benchmark datasets
            for evaluating multiple sequence alignments
          • Nevertheless, some issues with the dataset exist (to be covered
            in class)



                                                                              60
Bioe 144/244
Introduction to Protein Informatics




                                      ClustalW
                                                Lecture notes on ClustalW
Bioe 144/244
Introduction to Protein Informatics
                                                               from Per Kraulis




                                                                                                          62
                                      http://www.avatar.se/lectures/molbioinfo2001/multali-clustal.html
                                                Lecture notes on ClustalW
Bioe 144/244
Introduction to Protein Informatics
                                                               from Per Kraulis




                                                                                                          63
                                      http://www.avatar.se/lectures/molbioinfo2001/multali-clustal.html
Bioe 144/244
Introduction to Protein Informatics




                                      64
Bioe 144/244
Introduction to Protein Informatics




                                      The algorithm




                                                 65
Bioe 144/244
Introduction to Protein Informatics




                                      66
Bioe 144/244
Introduction to Protein Informatics




                                      67
Bioe 144/244
Introduction to Protein Informatics
                                      Summary of ClustalW method
            • For years, ClustalW was the best method available.
            • It’s still a solid performer, provided that sequences are
              closely related and do not have significant structural
              differences
            • Distinguishing characteristics:
                     – Progressive alignment based on a guide tree
                     – Gap parameters informed by hydrophobicity of amino acids and
                       by previously inserted gaps
                     – Amino acid substitution matrices derived from observed
                       sequence divergence (different matrices for different groups)




                                                                                 68
Bioe 144/244
Introduction to Protein Informatics
                                      Multiple alignment summary

             • MSA methods vary in computational complexity
                      – Some are very slow (e.g., ProbCons, T-Coffee, FSA, SATCHMO)
                      – Some are very fast (e.g., MAFFT, MUSCLE)
                      – Is the extra time worth the trouble?
             • Some methods are optimized for local alignment, while others
               are optimized for global alignment
                      – Most methods assume input sequences are globally alignable
                      – Restrict sequences to the homologous regions
             • Selection of sequences, alignment method and subsequent
               editing protocols must be guided by the intended use
             • All methods have serious problems in accuracy when sequence
               identities drop below 25%
             • Iterative methods can have problems when outlier (orphan)
               sequences are included


                                                                                      69

								
To top