Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

kiran-jeff by wanghonghx


									           Multiple Sequence
A survey of the various programs available and application of MSA
            in addressing certain biological problems

                       Jeff Mower
                      Kiran Annaiah
               Sequence Comparison
   Aligning two sequences is the cornerstone of
      Sequence alignment is the basic step upon which
       everything else built.
   Sequence alignments are employed in
      predicting de novo secondary structure of proteins,
      The initial functional assignment
      knowledge-based tertiary structure predictions,
      Interpretation of data from genome sequencing
      Inference of phylogenetic trees and resolution of
       ancestral relationships between species.
               Sequence Comparisons
   Homology Searches
      Look for homologous sequences in databases using
       FASTA or BLAST      program

   Pattern Searches
      Used for searching short sequence patterns

   Multiple Sequence Alignment
     For aligning and comparing 3 or more related

   Profile Analysis
      A profile is created from a multiple sequence
                       MSA vs Pairwise
   PSA
    – based on seq. similarity we can identify unknown biological
   MSA
       Similar to PSA
       But also possible to identify conserved sub-patterns based on
        known biological relationship
   High seq similarity – functional & structural similarity
   Sequences with functional and structural similarity can
    differ in sequences
       PSA cannot detect this case
       Example : Haemoglobin
                    MSA vs Pairwise

 Structurally and functionally conserved molecules can
  differ in sequence – PSA cannot reveal conserved
 Comparing of 2 sequences with high similarity – patterns
  detection lost due to high similarity

   MSA – useful in revealing critical patterns from multiple
    related sequences
            Multiple Sequence Alignment

   Homology
      Homologous sequences, derive from common
      Inferred by sequence similarity
      MSA useful to demonstrate homology
          Weak similarity – non-significant in pairwise comparison
           could be highly significant if same residues are conserved
           in other distantly related sequences
              Multiple Sequence Alignment
 Global or Local Alignments
 Substitution Matrices and weighting gaps
       Best alignment is one that represents an evolutionary
       Mutational events in evolution considered in MSA
            Substitutions
            Insertions
            Deletions
       BLOSUM and PAM – based on evolutionary distances
       Affine gap penalty model
            p = a + bL
            p = a + b blog(L)
   Scoring MSA
       Sum of Pairs score (SP) for columns
               Which MSA method…?
   Global Methods
     Optimal Algorithms (MSA, MWT, MUSEQAL)
     Progressive (MULTALIN, PILEUP, CLUSTAL,
        MULTAL, AMULT, DFALIGN, T-Coffee, MAP,
        PRRP, AMPS)
   Local methods
        BlockMaker, Iteralign
 Statistical (HMMT, SAGA, SAM, Match Box)
 Parsimony (MALIGN, TreeAlign)
                   Progressive algorithm

   Alignment is only an approximate solution (heuristic
   Simplest and effective ways of MSA
   Less time and less memory
   Sequences are added one by one to the multiple
    alignment based on a pre-computed dendogram
   Sequence addition is by PSA algorithm
   Disadvantage
       Once a sequence is aligned, cant be changed even if it conflicts
        with later sequence additions
   Examples
       PileUp, ClustalW, MultAlign, T-Coffee, etc
                       Exact Algorithms

   Useful in cases where sequences are extremely
   Simultaneously aligns all sequences
   Disadvantage
       Need to generalize Needleman-Wunsch algorithm
          Only for a maximum of 3 sequences
          Causes exponential increase in time and memory as number of
           sequences increase

   Examples
       MSA (up to 10 closely related sequences)
                    Iterative algorithms

   Iterate over a existing sub-optimal solution, modifying it
    at each step, until a convergence point is met.
   Examples
       SAGA, AMPS, Praline, IterAlign
            Consistency-based Algorithms

   Given a set of sequences, an optimal MSA is one that
    agrees most with all possible optimal pairwise
       Do not depend on specific subs. Matrix
       Score associated with alignment of 2 residues depends on their
        indexes (position within protein sequence)
       Most consistent are often the ones close to truth

   Examples
       T-Coffee, DiAlign
                 Multiple Alignment by profile
                         HMM training
Given n sequences , consider the following cases:
1.      If the profile HMM is known, the following procedure can be applied:
                 Align each sequence S(i) to the profile separately.
                 Accumulate the obtained alignments to a multiple alignment.

2.      If the profile HMM is not known, one can use the following technique in
        order to obtain an HMM profile from the given sequences:
                 Choose a length L for the profile HMM and initialize the
                     transition and emission probabilities.
                 Train the model using the Baum-Welch algorithm, on all the
                     training sequences.
                 Obtain the multiple alignment from the resulting profile HMM,
                     as in the previous case.
Schematic of T-Coffee
                Testing the methods

   BAliBASE benchmark
      “Correct” Alignments
      Core Blocks of Conserved Motifs
      Typical “Hard Problem” Sets
BAliBASE reference sets, showing the number of alignments in each set

Reference                        Short         Medium (200-      Long
                                 (<100         300 residues)     (>400
                                 residues)                       residues)
Reference 1: equidistant
sequences of similar length
V1 (<25% identity)               7             8                 8
V2 (20-40% identity)             10            9                 10
V3 (>35% identity)               10            10                8
Reference 2: family versus       9             8                 7
Reference 3: equidistant         5             3                 5
divergent families
Reference 4: N/C-terminal        12
Reference 5: insertions          12
BAliBASE - Reference set 1
Scores based on core blocks(V1)   Scores based on full-length alignment(V1)
Median score for Reference set 2
Median score for aligning subgroups of sequences in Reference set 3
Median score for N/C terminal extensions – Reference set 4
Median score for internal insertions – Reference set 5
   Core blocks aligned well over long sequences – all programs
      Due to different patterns of conservation
   Alignment unreliable in the twilight zone
   Iterative method did well under distinct alignment conditions, but
    not in the presence of an orphan sequence
   Global algorithms – accurate and reliable for
      equidistant sequences
      divergent families of sequences and
      alignment of orphan sequence with a family

   Local algorithm – DiAlign
        Best for N/C terminal extensions
        Internal insertions
    ClustalW vs DiAlign vs T-Coffee vs POA

   ClustalW – global progressive method
     Guide tree created
     Successive pairwise alignment

   Poa – progressive using partially ordered graphs
     No tree to guide alignment of sequences
     2 most similar seqs are aligned and others are added to this one profile
      in a stepwise fashion
   DiAlign – local algorithm
       Aligns whole segments
       PSA performed and ungapped alignments used
   T-Coffee – progressive global & local
     Performs PSA – twice, once global (ClustalW) and local (LAlign)
     Results combined into primary library, then extension step
     Progressive alignment using info from library
Results of BAliBASE testing
CPU time consumed by each program to align sets of increasingly long sequences

   Poor performance by all programs with increase in
    evolutionary distance
   Increase in seq length – better alignment
   T-Coffee – best for low – moderate evolutionary
   DiAlign good for high evolutionary distances
       What Can We Do With MSAs?

   Motif / pattern identification
   Structural modeling
   Phylogenetic analysis
   Molecular evolutionary analyses
   Identification of conserved genomic regions
    across species
              Phylogenetic Analysis

   Visual representation of a MSA as a tree of
   Many methods:
      Distance: builds tree by clustering sequences
       according to their similarity
      Parsimony: finds tree that minimizes the number
       of changes required
      Maximum Likelihood: finds tree that maximizes
       likelihood given parameters
      Bayesian: Markov chain Monte Carlo simulation to
       calculate posterior probabilities of trees
  Determining Relationships

Phylogram             Cladogram

                           (Baldauf 2003)
    Rooted vs Unrooted Trees

Rooted                Unrooted

                                 (Baldauf 2003)
Many taxa (567),
 Few genes (3)

(Soltis et al. 1999)
 Few taxa (13),
Many genes (61)

                  (Goremykin et al. 2003)
Many taxa,
Many genes

             Coming Soon…
Detecting Gene Families

                          (Baldauf 2003)
Human, Gorilla, & Chimp
Glycophorin gene family

        (Wang et al. 2003)

                (Baldauf 2003)
       Molecular Evolution Background

   Types of changes in protein-coding DNA
      Silent (synonymous)
      Replacement (nonsynonymous)
      Based on degeneracy of the genetic code

   K – frequency of change between two sequences
      Ka: # of replacement changes per replacement site
          Driven by natural selection
          Reflect level of protein conservation
      Ks: # of silent changes per silent site
          Neutral, not affected by selective forces
          Used to estimate the neutral mutation rate
    Estimation of Substitution Rates

   Absolute Rate
      R = ½ K / T, where T = time since last common
      Rates are directly comparable
   Relative Rate
      Two species of interest and one outgroup
      Compare K from species 1 and outgroup against K
       from species 2 and outgroup
         Relative Rate

               K Rat-Kan =        +

Human          K Hum-Kan =          +

               K Rat-Kan - K Hum-Kan =   -
     Evaluation of Selective Pressures,
              The Ka / Ks Ratio

 Ka / Ks > 1 indicates positive selection
 Ka / Ks ≈ 1 indicates no selection
 Ka / Ks < 1 indicates purifying selection
 280 homologs,

  (Wang et al. 2003)
        Conserved Genomic Regions

 Need complete genomes or homologous
  genomic regions
 Identify exons from distantly related species
 Identify regulatory elements from more closely
  related species
(Thomas et al. 2003)
   Biological Sequence Analysis – R. Durbin, S. Eddy, A. Krogh & G. Mitchison
   Bioinformatics: Sequence, structure and databanks – D. Higgins & W. Taylor
   Recent progress in multiple sequence alignment: a survey.
   Quality assessment of multiple alignment programs.
   Multiple sequence alignment with the Clustal series of programs.
   MAVID multiple alignment server
   Multiple alignment of sequences and structures.
   Fast algorithms for large-scale genome alignment and comparison.

To top