Multiple Sequence Alignments

Document Sample
Multiple Sequence Alignments Powered By Docstoc
					Multiple Sequence
Profiles and Progressive
Profiles for families of sequences
     can be built from MSAs
                                       1       2      3
   1   2   3
                               A     50% 75% 25%
   C   G   —                   C     25%      0%      0%
   A   A   T                   T      0%      0% 25%
   A   A   A                   G      0% 25%          0%
   — A     —                   —     25%      0% 50%

       Note: While profiles can be used for any kind of
       sequence data, we’ll focus on protein sequences
• Profile: A table that lists the frequencies of each
  amino acid in each position of protein sequence.
• Frequencies are calculated from a MSA containing a
  domain of interest
• Allows us to identify consensus sequence
• Derived scoring scheme allows us to align a new
  sequence to the profile
   – Profile can be used in database searches
   – Find new sequences that match the profile
• Profiles also used to compute multiple alignments
   – Progressive alignment
      Profiles: Position-Specific
       Scoring Matrix (PSSM)
• To compare a sequence to a profile, need to
  assign a score for each amino acid
• The score the profile for amino acid a at
  position p is           20
            M ( p , a )  f (p , b )  s (a , b )
                       b 1

   – f(p,b) = frequency of amino acid b in position p
   – s(a,b) is the score of (a,b) (from, e.g., BLOSUM
     or PAM)
  Profiles: PSSM

                                             Insertion/deletion penalty
Gribskov et al. PNAS. 84 (13): 4355 (1987)
 Profiles: Consensus Sequence
• A consensus residue C(p) is generated at each
  position of the profile to aid the display of
  alignments of target sequences with the
• The consensus residue c is the amino acid at p
  that has the highest score M(p,c).
  – c is the amino acid most mutationally similar to all
    the aligned residues of the probe sequences at p,
    rather than the most common one
Aligning a sequence to a profile
                         1   2   3 4     5
 K   L   M   –   K    K .75     .25     .75
 K   L   K   L   K
 K   M   M   L   –    L     .75     .75
 M   L   –   L   M    M .25 .25 .50     .25
                      -               .25 .25 .25
New sequence:
 K K L L         M
                          K   K   L    -   L   M
                          K   -   L    M   –   K
Align with profile:       K   -   L    K   L   K
K K L - L M               K   -   M    M   L   –
1 - 2 3 4 5               M   -   L    –   L   M
     Scoring a sequence-to-profile
• Score each column separately according
  to PSSM
• Each character contributes to score,
  weighed by its frequency
    1   2   3 4     5    K   K   L   -   L   M
 K .75     .25     .75   1   -   2   3   4   5
 L     .75     .75
 M .25 .25 .50     .25   Column 1 score:
                         0.75 s(K,K) + 0.25 s(K,M)
 -         .25 .25 .25
Profile-to-sequence alignments
• Optimum alignment can be found by
  dynamic programming
  – Extension of Needleman-Wunsch
• Spaces are only added to msa – never
  – Once a gap, always a gap
• Can align profiles to profiles
       Evolutionary Profiles
• Profiles just seen are called average profiles
• Generally perform well, but disregard some of
  the biology
  – How did each position evolve?
  – Amount of conservation varies from position to
  – Type of conservation varies from position to
• Alternative: Evolutionary profiles
  – Gribskov, M. and Veretnik, S., Methods in
    Enzymology 266, 198-212, 1996
       Evolutionary Profiles
• Idea: Fit a different model at each position
• For each position i :
  – For each possible ancestor b for position i
     • Try various evolutionary distances x (assume PAM
       model), and choose the one that minimizes cross
                      H  fa ln pa
                             a 1
        – fa = observed frequency of a
        – pa= predicted frequency of a assuming b is the ancestor
              and x is the distance
     • This generates 20 distributions for position i
        Evolutionary Profiles
• For each position i
  – Compute “mixture coefficient,” Wai,
    measuring likelihood that the residue a
    generated observed distribution (see text)
  – Profile is given by

     • paij = frequency of residue j in the ancestral
       residue distribution a at position i
     • prandom j = frequency of residue j in the database
 Progressive multiple alignment
• Feng & Doolittle 1987, Higgins and
  Sharp 1988
• Idea: Sequences to be aligned are
  phylogenetically related
  – these relationships are used to guide the
• Popular implementations: CLUSTALW,
  PILEUP, T-Coffee
1. Perform pair-wise alignments between all
   pairs of sequences (n x (n-1)/2 possibilities)
2. Generate distance matrix.
  •   Distance between a pair = number of mismatched
      positions in alignment divided by total number of
      matched positions
3. Generate a Neighbor-Joining ‘guide tree’
   from distance table
4. Use guide tree to progressively align
   sequences in pairs from tips to root of tree.
  •   Actually, align profiles
  •   “Once a gap, always a gap”
          CLUSTALW Tree

Tree calculated from an alignment of more than 1100 ring finger
domains, using ClustalW 1.83.
          CLUSTALW heuristics
1. Individual weights are assigned to each sequence in a
   partial alignment in order to downweight similar
   sequences and up-weight highly divergent ones.
2. Varying substitution matrices at different alignment
   stages according to sequence divergence.
3. Gaps
   • Positions in early alignments where gaps have been opened
     receive locally reduced gap penalties
   • Residue-specific gap penalties and locally reduced gap
     penalties in hydrophilic regions encourage new gaps in
     potential loop regions rather than regular secondary
Progressive Alignment: Discussion
• Strengths:
  – Speed
  – Progression biologically sensible (aligns
    using a tree)
• Weaknesses:
  – No objective function.
  – No way of quantifying whether or not the
    alignment is good
   Problems with CLUSTALW
• Local minimum problem:
  – Alignment depends on sequence addition order.
  – With each alignment some proportion of residues
    are misaligned
     • Worse for divergent sequences
  – Errors get “locked in” and propagate as sequences
    are added
  – Can result in arbitrary and incorrect alignments
• Clustal uses global alignment … may not be
  accurate for all parts of the sequence
  – T-Coffee considers local similarity as well as global
         Iterative alignment
• To avoid local minima, realign subgroups of
  sequences and then incorporate them into a
  growing multiple sequence alignment
  – Improves overall alignment score.
  – May involve rebuilding the guide tree
  – May be randomized
• Programs:
  – MultAlin
  – PRRP
    Phylogenetic Alignment

Given a tree for a set of species S, find
ancestral species such that total distance is

       CTGG                     GTGG


Shared By: