Global Multiple Sequence Alignment

Document Sample
Global Multiple Sequence Alignment Powered By Docstoc
					             Multiple Sequence Alignment

Workshop on Developing Bioinformatics Programs

July, 2008


Hugh B. Nicholas Jr.
nicholas@psc.edu
Pittsburgh Supercomputing Center




             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
1
              Sequence Analysis - Overview
                                                                                             Get New Sequence
Get Sequences
                                          Literature Search
from Database
                                                                                             Single Sequence Analysis


                                                                                             Database Search


                      Multiple Sequence Alignment


Database Search for Examine                                       Find Distinctive Integrate
Distantly Related   Conserved                                     Subfamily        Patterns with
Sequences           Patterns                                      Patterns         Structure


          These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
2
                        Focus: What to Remember
       What kinds of information are contained in the alignment.
       What is revealed by patterns of conservation within the
        alignment.
       How Multiple Sequence Alignments are scored.
       Major types of Global Multiple Sequence Alignment
        Algorithms.
       Limitations of each major type of alignment algorithm.
       Features that may indicate uncertainty in the alignment.




                These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
    3
           Global Multiple Sequence Alignment

       Shows the relationships among a family of
        sequences.
       Detailed information about a family of genes and
        their protein products.
       Works on the entire length of the sequences.
       Assumes that the sequences are homologous.
       All available programs have both theoretical and
        practical limitations.
       Additional knowledge from structure, experiments,
        or other computations can be used to guide editing
        of the initial alignment.

           These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
4
       Multiple Sequence Alignment Problem
                                      Unobserved Ancestral Sequence
AGCATGATGCGC
AGCCTCATCTCA
AGCCTG...CGC
ACT...ACATTG


         Unobserved                                                                                                 Unobserved
         Descendant                                                                                                 Descendant




    AGCATGATGCGC                  AGCCTGCGC                              ACTACATTG                            AGCCTCATCTCA


          These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
5
                     Multiple Sequence Alignment
                  140         *       160         *       180         *       200
bovbpi     :   KWK---ARKNFIKLGGNFDLSVEGISILAGLNLGYDPASGHSTVTCSSCSSHINSVHVHISKSK-VG                                                                :   178
humbpi     :   KWK---AQKRFLKMSGNFDLSIEGMSISADLKLGSNPTSGKPTITCSSCSSDIADVEVDMSGD--SG                                                                :   182
humlbpa    :   RWK---VRKSFFKLQGSFDVSVKGISISVNLLLGSES-SGRPTGYCLSCSSDIQNVELDIEGD--LE                                                                :   175
rablpb     :   SWK---VRKAFLRLKNSFDLYVKGLTISVHLVLGSES-SGRPTVTTSSCSSRIRDLELHVSGN--VG                                                                :   176
ratlbp     :   KWK---VRRSFVKLHGSFDLDVKSVTISVDLLLGVDP-SERPTVTASGCSNSFHKLLLHLQGEREPG                                                                :   177
humcetp    :   TLKYGYTTAWWLGIDQSIDFEID-SAIDLQINTQLTCDSGRVRTDAPDCYLSFHKLLLHLQGEREPG                                                                :   178
maccetp    :   TLKYGYTTAWGLGIDQSVDFEID-SAIDLQINTQLTCDSGRVRTDAPDCYLAFHKLLLHLQGEREPG                                                                :   178
rabcetp    :   TLNYSYTSAWGLGINQSVDFEID-SAIDLQINTELTCDAGSVRTNAPDCYLAFHKLLLHLQGEREPG                                                                :   162
hupltp     :   RRQ---LLYWFFYDGGYINASAEGVSIRTGLELSRDP-AGRMKVSNVSCQASVSRMHAAFGGT--FK                                                                :   162
mupltp     :   RRQ---LLYWFLYDGGYINASAEGVSIRTGLQLSQDS-SGRIKVSNVSCEASVSKMNMAFGGT--FR                                                                :   162
rrrya3     :   SGP----------LVGLLQLAAE-VNVSSKVALGMSP-RGTPILILKRCNT----LLGHISLT--SG                                                                :   173
rry2g5     :   KS-----------LIGFLDIAVE-VNITAKVRLTMDR-TGYPRLVIERCDT----LLGGIKVKLLRG                                                                :   164
g2599572   :   AP------------LHTVPMPVR-ISIRADLHVDMGP-DGNLQLLTSACRP-----TVQAQST----                                                                :   139
                                          i           g        C

                       *       220         *       240          *      260
bovbpi     :   WLIQLFHKKIESALRNKMNSQVCEKVTNSVSSKLQPYFQTLPVMTKLDKVAGVDYSLVAPPRATANN                                                                :   245
humbpi     :   WLLNLFHNQIESKFQKVLESRICEMIQKSVSSDLQPYLQTLPVTTKIDSVAGINYGLVAPPATTAET                                                                :   249
humlbpa    :   ELLNLLQSQIDARLREVLESKICRQIEEAVTAHLQPYLQTLPVTTEIDSFADIDYSLVEAPRATAQM                                                                :   242
rablpb     :   WLLNLFHNQIESKLQKVLESKICEMIQKSVTSDLQPYLQTLPVTTQIDSFAGIDYSLMEAPRATAGM                                                                :   243
ratlbp     :   WIKQLFTNFISFTLKLVLKGQICKEI-NVISNIMADFVQTRAASADIDTILGIDYSLVAAPQAKAQT                                                                :   243
humcetp    :   WIKQLFTNFISFTLKLVLKGQICKEI-NIISNIMADFVQTRAASILSDGDIGVDISLTGDPVITASY                                                                :   244
maccetp    :   WLKQLFTNFISFTLKLILKRQVCNEI-NTISNIMADFVQTRAASILSDGDIGVDISLTGDPIITASY                                                                :   244
rabcetp    :   WLKQLFTNFISFTLKLILKRQVCNEI-NTISNIMADFVQTRAASILSDGDIGVDISVTGAPVITATY                                                                :   228
hupltp     :   KVYDFLSTFITSGMRFLLNQQICPVLYHAGTVLLNSLLDTVPVRSSVDELVGIDYSLMKDPVASTSN                                                                :   229
mupltp     :   RMYNFFSTFITSGMRFLLNQQICPVLYHAGTVLLNSLLDTVPVRSSVDDLVGIDYSLLKDPVVSNGN                                                                :   229
rrrya3     :   LLPTPIFGLVEQTLCKVLPGLLCPVV-DSVLSVVNELLGATLSLVPLGPLGSVEFTLATLPLISNQY                                                                :   239
rry2g5     :   LLPNLVDNLVNRVLANVLPDLLCPIV-DVVLGLVNDQLGLVDSLVPLGILGSVQYTFSSLPLVTGEF                                                                :   230
g2599572   :   -REAESKSSRSILDKVVDVDKLCLDV-SKLLLFPNEQLMSLTALFPVTPNCQLQYLALAAPVFSKQG                                                                :   204
                         i      l    C                 t      d        l   P

                These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
6
                   Multiple Sequence Alignment

       Our best guess at the detailed evolutionary
        history of a family of related genes.
           Speciation events
           Gene duplications
           Mutations of nucleotides and selection of amino
            acids
           Insertion and deletion events
       Given a method of computation, it implies a
        specific phylogeny of the genes and species.



             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
7
                      Multiple Sequence Alignment

       The pattern of variation and conservation at each
        position within the alignment provide information about
        the structural and functional role of that position in the
        gene or protein.
           Highly conserved positions are likely to be critical to the
            structure or function of the molecule.
           Positions that vary systematically (limited variation) between
            defined subgroups may reflect the evolution of new functions
            or the structural variation required to support a new function.
           Highly variable positions provide scaffolding or filler.
           Conserved hydrophobic residues are probably important to
            structure.
           Conserved polar residues may be important for catalytic or
            functional interactions.

                These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
8
    Three Dimensional Alignment Space
                                                                                                        2 diagonal (j,k)



              sequencei
                                                                                                                    3-diagonal
                                                                                                                          (i,j,k)
      2-diagonal (i,j)


                          sequencej
                                                                                                           2-diagonal (i,k)
                                                                   sequencek


    A three dimensional alignment space (for aligning three
      sequences) showing the two and three dimensional
          path graphs associated with the alignments.

        These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
9
             DISTANCE and Analyzing Sequences

        Distance is a dissimilarity measure that has four
         defining properties. If d(a, b) is the distance
         between sequence a and sequence b. Then
            d(a, b) >= 0.
            d(a, b) = 0, only if sequence a and sequence b are the
             same sequence.
            d(a, b) = d(b, a).
            d(a, b) + d(a ,c) >= d(b, c).
        This fourth property, called the triangle inequality, is
         particularly important if we wish to evaluate the
         relationship among more than two sequences at a
         time.
                These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
10
                         The Triangle Inequality


         d(a, b)  d(a,c)  d(b, c)
        Is the algebraic equivalent of the Euclidean
         postulate that three sides form a triangle.
        It allows us to construct a map.
        A map is the simultaneous representation of the
         relationships among three or more objects.



             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
11
     d(a, b)  d(a,c)  d(b, c)
         If              true                                                                                       false
     d(a,b)
     d(a,c)

     d(b,c)

                    A                                                                                      A cannot be placed
                                                                                                           on the map.

      d(a,b)                         d(a,c)                                                  d(a,b)                                                   d(a,c)

                                                                                             B                                                          C
       B                                   C
                 d(b,c)                                                                                                    d(b,c)

           These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
12
                  Alignment Scoring Methods

                                   Distance between sequences
                                                       Seq_1                  Seq_2                  Seq_3                  Seq_4
Sequences
     Seq_1 A                    Seq_1                         0                      0                      1                      1

     Seq_2 A                    Seq_2                         0                      0                      1                      1

     Seq_3 C                    Seq_3                         1                      1                      0                      0

     Seq_4 C                    Seq_4                         1                      1                      0                      0




          These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
13
                         Evolutionary Score = 1



     Seq_2 A                                                                                                 Seq_3 C


                                               A                             C



     Seq_1 A                                                                                                 Seq_4 C

         These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
14
                                     Star Score = 2


     Seq_2 A                                                                          Seq_3 C



                                                                    A


        Seq_1 A                                                                      Seq_4 C


     These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
15
                          Sum of Pairs Score = 4

                                                                                                                   Seq_3 C
Seq_2 A




Seq_1 A                                                                                                                      Seq_4 C
          These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
16
         Multiple Sequence Alignment Algorithms
    Pairwise Progressive method
        Aligns a pair of sequences and successively adds more
         sequences to the extant alignment.
        Deals with many sequences very rapidly.
        Local minima problems.
    Multiple Dimensional Dynamic Programming
        Needleman-Wunsch with more than two sequences.
        Exact, best solution for specific scoring scheme.
        Requires a lot of memory, thus can align only a few sequences.
    Consistency
        Custom scores based on how often particular residues are
         aligned in a wide ranging set of pairwise alignments.
        Incorporates both global and local alignment information
        Can incorporate other kinds of information from other kinds of
         alignments such as structural superpositions.


             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
17
                    Progressive Pairwise Alignment
                                ClustalW , PileUp, Multalign


     Five peptides and a tree showing their relative overall similarity




      itcg                             itck                     ltscg ktcsg                                                itctd

             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
18
     Begin with the alignment of the most similar
                  pair of sequences




     itcg
     itck   } itc(k,g)
            itcg                          itck                           ltscg                         ktcsg                          itctd
            These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
19
 Progressively align the most similar sequences


                         (i,l,k)t(-,s,c)(t,s,c)(g,k,d)

     (i,l)t(-,s)c(g,k)
                                                                                                (k,i)tc(t,s)(g,d)
     itc(g,k)


 itcg                    itck                         ltscg                     ktcsg                          itctd


          These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
20
                    Resulting Alignment

     Actual                                                                         Desired
     ktcsg                                                                          kt-csg
     itctd                                                                          it-ctd
     lt-ck                                                                          lt-c-k
     it-cg                                                                          it-c-g
     ltscg                                                                          ltsc-g




     These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
21
               Progressive Pairwise Alignments
               Major Limitations and Strengths

    Local Minima problems that result from:
        Looking only at a subset of the data at any one time
        Cannot change the alignment of sequences that were
         aligned early on the basis of sequences introduced later.
    Most common error is failure to identify highly
     conserved residues because they are put into several
     columns of the alignment rather than into a single
     column.
    May avoid spurious alignment of gaps into too few
     locations that is a side effect of sums of pairs scoring.
    Very fast, can run on personal computers and does
     not require much memory.


               These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
22
                   3 Dimensional Path Graph


                                F
                         L
                 Q
          D

     Q
                                                                           D - - Q - L F
     G                                                                     D N V Q - - -
     L                                                                     - - - Q G L -
            D        N       V          Q




         These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
23
     Three Dimensional Alignment Space
                                                                                                         2 diagonal (j,k)


                                   i
                                                                                                                     3-diagonal
                                                                                                                           (i,j,k)
       2-diagonal (i,j)


                                                   j
                                                                                 k                          2-diagonal (i,k)




                  A 3-diagonal and three 2-diagonals it contains.




         These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
24
         Path Graph of a pairwise alignment
              G         C          T         G          G A                 A          G         G          C         A          T
     G
     C
     A
     G
     A
     G
     C
     A
     C
     T


         These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
25
     Path Graph projected from a multiple
             sequence alignment
               G C                 T        G G A A                                G G C A                                 T
       G
       C
       A
       G
       A
       G
       C
       A
       C
       T

        These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
26
Area scoring less than the projected alignment
            G C T G G A A G G C A T
     G
     C
     A
     G
     A
     G
     C
     A
     C
     T   These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
27
     The projected areas form a three dimensional
              volume in alignment space
                                                                                                          2-diagonal (j,k)


                                     i
                                                                                                                       3-diagonal
                                                                                                                             (i,j,k)
         2-diagonal (i,j)


                                                    j
                                                                                   k                          2-diagonal (i,k)




                    A 3-diagonal and three 2-diagonals it contains.




           These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
28
         Multidimensional Dynamic Programming
             Major Limitations and Strengths
    Spuriously aligns gaps that are the result of different
     insertion or deletion events because of rigorous sum of
     pairs scoring.
    Finds a well defined, rigorous, sum of pairs optimal
     alignment for the region of alignment space that is
     examined.
    This has a good chance of being the absolute sum of pairs
     optimal alignment for the sequences given the set of
     scores.

             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
    29
                   MSA Alignment Strategy




     These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
30
        Seryl tRNA Synthetase Alignments

Name                       Length                                  Memory                                            CPU time
                                                                   * 1,000,000 bytes                                                     seconds
Sys_Bacsu                  425 aa.
Sys_Ecoli                  430 aa.
Sys_Human                  514 aa.                                      0.58                                              1.2
Sys_Yeast                  462 aa.                                      0.70                                              5.3
Sys_Mycge                  417 aa.                                     17.4                                             120.4
Sys_Theth                  421 aa.                                     45.7                                             550.5
Sys_Halma                  460 aa.                                 10,926.                                           27,818.*
       * Alignment less than 20% complete on DEC 8400
                All runs used MSA release 2.1

         These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
 31
             T-COFFEE (consistency) Algorithm

        Generate global pairwise alignments using the
         Needleman-Wunsch algorithm from ClustalW
         with it’s heuristic rules.
        Generate the 10 best nonintersecting local
         alignments using the Waterman-Eggert extension
         to the Smith-Waterman algorithm.
        Generate initial scores from these alignments.
        Extend the scores by examining alignments from
         triplets of sequences for consistency.
        Align all of the sequences using a progressive
         pairwise strategy.


              These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
32
          Computation Grid for Smith-Waterman


     b1


     b2


     b3     SW3,3 + s(a4,b4)                                                                                 SW4-l,4 + g;
                                                                                                             l = 1, 2, or 3
     b4



     b5

           a1                     a2                      a3                         a4                     a5                        a6

            These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
33
                                    T-Coffee: Pairwise Alignments
             200                                                                                         200




                                                                                            GT26_SCHMA
GT26_SCHMA




             150                                                                                         150



             100                                                                                         100



             50                                                                                           50



               0                                                                                           0
                   0      50      100           150           200                                              0   50           100            150            200

                               GTM1_HUMAN                                                                                  GTP_HUMAN


                                                                                                         200




                                                                                            GTM1_HUMAN
                                                                                                         150
                        Global Alignment
                                                                                                         100



                        Local Alignment                                                                  50



                                                                                                           0
                                                                                                               0   50           100            150            200

                                                                                                                           GTP_HUMAN

                                 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
                   34
         Consistency Among Three Sequences

Observed alignments between a sequence and two others
     GTM1_HUMAN => (187) FEGLEKISAYMKSSRFLPRP                                                                                           (206)
     GTP_HUMAN => (170) LDAFPLLSAYV..GRLSARP                                                                                            (187)
and
     GTM1_HUMAN => (187) FEGLEKISAYMKSSRFLPR                                                                                            (205)
     GT26_SCHMA => (170) LNEFPKLVSFKKCIEDLPQ                                                                                            (188)

Implies an alignment between the two other sequences:
     GTP_HUMAN => (170) LDAFPLLSAYV..GRLSAR                                                                                             (186)
     GT26_SCHMA => (170) LNEFPKLVSFKKCIEDLPQ                                                                                            (188)

Only part of the implied alignment is actually observed:
     GTP_HUMAN => (170) LDAFPLLSAYV                    (180)
     GT26_SCHMA => (170) LNEFPKLVSFK                   (180)

             These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
  35
              T-Coffee Strengths and Weaknesses
    Empirically observed to yield highly accurate alignments
     on problems where even MSA gives flawed results.
    Can solve very large problems -- 153 GSTs.
        T-Coffee = 36 hours; ClustalW = 6 minutes; MSA = not feasible
    Can be customized to include a wide variety of
     information.
    Can require a lot of computer resources.




               These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
    36
     ProbCons Consistency in a Probabilistic,
              HMM Framework

    Convert Needleman-Wunsch to using the probabilities
     that residue xi should be aligned with residue yj when
     aligning sequences X and Y rather than log-odds scores.
        This is known as the Viterbi algorithm.
        This is a dynamic programming algorithm, as is the
         Needleman-Wunsch algorithm.
        Because it uses probabities rather than log-odds scores it is
         easily represented with a Hidden Markov Model of pairwise
         alignment.
        Each cell of the two-way alignment table cotains the
         probability that xi will be juxtaposed with yj in the best
         alignment of sequences X and Y.


              These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
37
                                              ProbCons

    The consistency information can be straight forwardly computer by
     multiplying together appropriate sets of these two dimensional,
     pairwise alignment tables or matrices.

             P′xy ← 1/{S} * ∑Z PxzPzy

    Use these probabilities as scores for computing a progressive
     pairwise alignment.
    Use iterative refinement, separate alignment into two arbitrary
     subalignments and then realign them.




               These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
38
                         Accuracy of Alignments
                                         Homstrad                                Prefab                        BaliBase

         ClustalW                              61.15                              61.68                            42.83

          T-Coffee                             65.37                              69.97                            56.10

         Probcons                              66.41                              70.54                            58.24

         Dialign-T                             57.92                              62.05                            44.59

        M-Coffee8                              67.75                              72.91                            62.02

 Wallace, IM, O’Sullivan, O, Higgins, DG, Notredame, C. 2006. M-Coffee: combining multiple
 sequence alignment methods with T-Coffee. Nucleic Acids Research. 34:1692-1699

                 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
39
                                    Alignment Strategies

        Align all or a large, representative subset with Probcons and T-Coffee.
        Align groups of sequences with Probcons and T-coffee and join the
         groups with ClustalW.
        Make several alignments in ClustalW with different scoring matrices
         and gap penalties.
        Edit the alignment making use of information from:
            Pattern finding programs
            Known structures in the family
            Known biochemistry
            Knowledge of the weaknesses of the alignment programs
        Be cautious – the human eye finds false patterns



                These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
40

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:3/5/2012
language:English
pages:40