Docstoc

Sequence Alignment

Document Sample
Sequence Alignment Powered By Docstoc
					Sequence Alignments

   BIOL/CHEM 4900
                  Reading

   Chapter 2 in your textbook
            Sequence Alignments
   Alignment between characters found in two or
    more nucleotide or amino acid sequences
       Amino Acidsresidues; Nucleic Acids
       Similarity between sequences
   What can this tell you?
   In this chapter:
       How do we align two or more sequences?
       How do we evaluate these alignments?
       What conclusions can we make based on these
        alignments?
                                 Dot Plots
   Used to visualize regions of similarity
   One sequence placed on the x-axis, the
    other on the y-axis
                                                 C           
   Dots are placed in the plot where the two
    sequences are identical                      T       
   Diagonal lines in plot indicate regions of
    similarity                                   A   
   Example: compare ATCG to GATC
   Advantages: easy, quick
                                                 G               
   Disadvantages: only gives regions of             A   T   C   G
    similarity, not actual alignment
   What would the dot plot look like with
    longer sequences?
                           Noise in Dot Plots
   Control by adjusting the following
        Window size
        Similarity cutoff
        Removing too much noise might conceal small region of similarity
   Example: GCTAGTCAGA and GATGGTCACA

         A                                          A

         C                                           C

         A                                          A

         C                                           C
                                                                                                    Complete
         T                                           T
                                                                                                    this plot!
         G                                          G
         G                                          G

         T                                           T

         A                                          A

         G                                          G

             G   C     T   A   G   T   C   A   G   A       G   C    T   A   G   T   C   A   G   A


                        Window of 1                                   Window of 4
                     Similarity cutoff of 1                        Similarity cutoff of 3
                            Dot Plots in Excel
A

C

A

C                            

T                       

G                   

G               

T           

A

G

    G   C   T   A   G   T    C   A   G   A
           Try the DotPlot Program
   Download the program from this link
   It will automatically save the program and several files to your
    desktop
   Open DotPlot application
   Load sequences as FASTA text files
        File, Open Horizontal, Browse
        File, Open Vertical, Browse
   Parameters menu changes length and cutoff
   Draw, Identities shows plot
   Clear screen when change parameters to visualize
   Example: Bos taurus and porcine myoglobin mRNA sequences
    (sequences on course website)
                      Simple Alignments
   Molecular changes occur when organisms evolve
        Mutation
             Most common
        Insertion
        Deletion
   Gaps in alignments
        Added to account for insertions/deletions
        Goal: to obtain optimal alignment
             Most likely to represent the true relationship between homologous sequences
   Consider the following sequences: AATCTATA and AAGATA
   Either 2 insertions in first sequence or 2 deletions in second sequence
   What is the optimal alignment?
   If no gaps allowed, there are three ways the
    sequences can be aligned:

    AATCTATA             AATCTATA             AATCTATA
    || ||                 |                        |||
    AAGATA                AAGATA                AAGATA

   Which alignment is optimal?
   Scoring alignments
       Match score = credit for identical aligned pair
       Mismatch score = penalty for nonidentical residues
       Total score = sum of match and mismatch scores
       Higher score = better alignment
   If gaps are allowed, there are many more ways
    the sequences can be aligned
   Three examples:

    AATCTATA             AATCTATA             AATCTATA
    AAG-AT-A             AA-G-ATA             AA--GATA

   Scoring must now account for gaps
       Gap penalty = penalty for each residue aligned with “–”
       Total score = match + mismatch + gap penalty
    If match = 1, mismatch = 0, and gap penalty = -1,
     what are the scores for these three alignments?

      AATCTATA                  AATCTATA             AATCTATA
      AAG-AT-A                  AA-G-ATA             AA--GATA

Score = 1                          3                           3



    Least likely to represent              Which of these
    evolutionary relationship          alignments is better?
                      Gap Penalties
   Is it more likely to have one longer insertion/deletion, or
    multiple smaller ones?
   Two types of gap penalties
   Length penalty
       Penalty for each residue aligned with “-”
   Origination penalty
       Penalty for presence of a gap
       Allows differentiation between alignments with many short gaps
        and those with fewer, longer gaps
       Further penalizes for rare insertion/deletion (indel) events
    If match = 1, mismatch = 0, length penalty = -1,
     and origination penalty = -2, what are the scores
     for these three alignments?

      AATCTATA                  AATCTATA              AATCTATA
      AAG-AT-A                  AA-G-ATA              AA--GATA

Score = -3                         -1                           1



    Least likely to represent           Now, which of these
    evolutionary relationship           alignments is better?
              Terminal Gaps
   Might not actually be indels
   Data could be incomplete
   Sometimes ignored in scoring

    AATCTATAGC
    AAG--ATA--
                  Mismatch Penalties
   Different mismatch scores depending on
    particular nucleotide or amino acid that is
    mismatched
       Reward mismatches that are more likely to
        occur (common substitutions)
       Nucleotides
            Purine vs. pyrimidine
            Transitions vs. transversions
                         Scoring Matrices
   Show scores for all non-gap positions in alignment
   For nucleotide sequences:

            A   T    C   G       A     T      C    G          A   T    C    G

     A      1   0    0   0   A   5     -4     -4   -4    A    1   -5   -5   -1

     T      0   1    0   0   T   -4    5      -4   -4    T   -5   1    -1   -5

     C      0   0    1   0   C   -4    -4     5    -4    C   -5   -1   1    -5

     G      0   0    0   1   G   -4    -4     -4   5     G   -1   -5   -5   1


         Identity (Sparse)            BLAST             Transition/transversion
                 Matrices for Proteins

   Amino acids
    1. Structure and properties
                                          Leucine   Isoleucine   Threonine
           Substitution of similar AAs
            more likely to retain
            protein function
            (conservative substitution)
    2. Genetic code
           Minimum number of
            nucleotide substitutions
            needed to convert a codon
            Matrices for Proteins

3. Actual observed substitution rates
      Point accepted mutation (PAM)
           Alignment constructed with high similarity (>85%)
           Calculate relative mutability (mj)
               Number of times one amino acid (j) is substituted by
                any other
           Calculate specific substitution (Aij)
               Number of times j is substituted by a specific amino
                acid i
           See Box 2.1 (page 40)
                   PAM Example
                                        Ambiguities:
                                             X = ambiguous amino acid
                                             B = Asn or Asp
                                             Z = Gln or Glu
                                             Some algorithms take
                                              ambiguities into account and
                                              score; some count them as
                                              identical; others ignore them
                                             If the sequence has lots of
                                              ambiguities scores may not be
                                              reliable with certain types of
                                              software

   Identical amino acids = highest score
   Conservative substitution = next highest score
   Non-conservative substitution = lowest score
                       PAM Matrices
   Pam matrix is normalized to represent substitution over
    a fixed period of evolutionary change
   PAM-1
       1 substitution per 100 residues
       Matrix represents probability of AA substitution in time it takes
        for 1% of all residues to be substituted
       Used to compare sequences that are closely related
   PAM-1000
       Used for sequences with distant relationships
   PAM-250
       Commonly used middle ground
                 BLOSUM Matrix
   Also derived from observing substitution rates in
    proteins
   Looks at clusters of amino acids sequences
   Lower numbered matrices used for more
    distantly related sequences
       BLOSUM-45 vs. BLOSUM-80
       BLOSUM-62 is the middle ground and default matrix
        in most protein alignment programs
             PAM and BLOSUM

BLOSUM 80        BLOSUM 62    BLOSUM 45
  PAM 1           PAM 250      PAM 1000

   Less                          More
 Divergent                     Divergent
                   Types of Scores
   Raw Score
       Protein and nucleotide alignments
       Sum the scores for matches, mismatches, and gaps
   Percent identities
       Protein and nucleotide alignments
       Ratio of residues that match up in both sequences to total
        number of residues compared
   Percent positives
       Protein alignments only
       Matrix values >1 are called positives
       Ratio of positive values to total number of residues compared
                     An Example
   Alignment of mouse and crayfish trypsin

Mouse    I   V   G   G Y    N C E E      N   S V    P   Y Q
         5   4   5   5 -3   2 -2 –2 –3   0   0 -1   6   10 4
Crayfish I   V   G   G T    D A V L      G   E F    P   Y Q


   Raw score = 30
   % Identities = 7/15 = 47%
   % Positives = 8/15 = 53%
        Algorithms for Alignments
   Global
       Dynamic programming
            Breaking a problem down into smaller subproblems, then rebuilding
       Needleman and Wunsch
       Aligns whole sequences
       All gaps accounted for (internal and terminal)
   Semiglobal
       Revised by Needleman and Wunsch
       Aligns whole sequences
       Only internal gaps count
   Local
       Smith and Waterman
       Aligns localized regions of similarity
       Ignore gaps
               Partial Scores Table
   Used to align sequences
   Top and left axes labeled with sequences
   Contains alignment scores for all alignment options
   Used to determine optimal alignment
   Example: alignment of ACTCG and ACAGTAG
   Rules for global alignment:
       Horizontal move = -1 (indicates gap in left axis)
       Vertical move = -1 (indicates gap in top axis)
       Diagonal move = +1 for match or 0 for mismatch
       First row and column are initialized with multiples of gap penalty
Initial Partial Scores Table
           A    C    T    C    G

      0    -1   -2   -3   -4   -5

  A   -1

  C   -2

  A   -3

  G   -4

  T   -5

  A   -6

  G   -7
     Start in outlined box
     Calculate the possible scores from diagonal, above, and left
     Put the LARGEST (best) score in the box
     Move across table to complete first row
     Move to second row, etc., until table is complete
                                       A      C      T      C        G

Diagonal = 0 + 1(match) = 1      0     -1     -2     -3     -4       -5
Top = -1 – 1 = -2
Left = -1 – 1 = -2        A     -1

                          C     -2

                          A     -3

                         G      -4

                          T     -5

                          A     -6

                         G      -7
Diagonal = -1 + 0(mismatch) = -1
Top = -2 – 1 = -3
Left = 1 – 1 = 0


                          A        C    T    C    G

                   0     -1        -2   -3   -4   -5

            A     -1      1

            C     -2

            A     -3

            G     -4

            T     -5

            A     -6

            G     -7
    Completed Table
         A    C    T    C    G

    0    -1   -2   -3   -4   -5

A   -1

C   -2

A   -3
                                  Now, trace the
                                  optimal path.
G   -4
                                  Start at the
                                  bottom right,
T   -5
                                  and move in
                                  the direction
A   -6
                                  that gave that
                                  score. End at
G   -7
                                  the top left.
    Completed Table
         A    C    T    C    G

    0    -1   -2   -3   -4   -5

A   -1   1    0    -1   -2   -3

C   -2   0    2    1    0    -1

A   -3   -1   1    2    1    0    Now, trace the
                                  optimal path.
G   -4   -2   0    1    2    2    Start at the
                                  bottom right,
T   -5   -3   -1   1    1    2    and move in
                                  the direction
A   -6   -4   -2   0    1    1    that gave that
                                  score. End at
G   -7   -5   -3   -1   0    2    the top left.
    Completed Path
             A        C        T        C        G

    0        -1       -2       -3       -4       -5
         
A   -1       1        0        -1       -2       -3
                  
C   -2       0        2        1        0        -1
                      
A   -3       -1       1        2        1        0
                      
G   -4       -2       0        1        2        2
                           
T   -5       -3       -1       1        1        2
                                    
A   -6       -4       -2       0        1        1
                                                     Now, write the
G   -7       -5       -3       -1       0        2    alignment…
     Writing the Alignment from the
           Partial Scores Table
    means the two residues are aligned
    means there is a gap in top axis
    means there is a gap in left axis
            A   C   T   C   G
        0  -1 -2 -3 -4 -5                     -TCG
                                               TCG
                                                CG
                                                 G
                                             --TCG
                                            C--TCG
                                           AC--TCG
         
     A -1 1 0 -1 -2 -3                       AGTAG
                                              GTAG
                                               TAG
                                                AG
                                                 G
                                            CAGTAG
                                           ACAGTAG
             
     C -2 0 2 1 0 -1
               
     A -3 -1 1 2 1 0
               
     G -4 -2 0 1 2 2
                
     T -5 -3 -1 1 1 2
                   
     A -6 -4 -2 0 1 1
                      
     G -7 -5 -3 -1 0 2
            Semiglobal Alignments
   Only internal gaps count
   Do not penalize gaps at ends of sequence
   Rules for semiglobal alignment:
       Horizontal move = -1 (indicates gap in left axis) EXCEPT in
        bottom row
       Vertical move = -1 (indicates gap in top axis) EXCEPT in last
        column
       Diagonal move = +1 for match or 0 for mismatch
       First row and column are initialized to zero
   Example: align ACACTG and ACACTGATCG
        Initial Partial Scores Table
         A   C   A   C   T   G   A   T   C   G

    0    0   0   0   0   0   0   0   0   0   0

A   0

C   0

A   0

C   0

T   0

G   0
            Diagonal = 0 + 0 (mismatch) = 0
            Top = 0 – 0 (no penalty last column) = 0
            Left = 0 – 1 = -1

        A   C     A     C      T     G     A      T    C   G

    0   0   0     0     0      0     0      0     0    0   0

A   0   1   0     1     0      0     0      1     0    0

C   0

A   0

C   0

T   0

G   0
            Diagonal = 0 + 0 (mismatch) = 0
            Top = 0 – 1 = -1
            Left = 0 – 0 (no penalty last row) = 0

        A   C     A      C     T      G     A        T   C   G

    0   0   0     0      0     0      0      0       0   0   0

A   0   1   0     1      0     0      0      1       0   0   0

C   0   0   2     1      2     1      0      0       1   1   0

A   0   1   1     3      2     2      1      1       0   1   1

C   0   0   2     2      4     3      2      1       1   1   1

T   0   0   1     2      3     5      4      3       2   1   1

G   0
            Completed Table
        A   C   A   C   T   G   A   T   C   G

    0   0   0   0   0   0   0   0   0   0   0

A   0   1   0   1   0   0   0   1   0   0   0

C   0   0   2   1   2   1   0   0   1   1   0

A   0   1   1   3   2   2   1   1   0   1   1

C   0   0   2   2   4   3   2   1   1   1   1

T   0   0   1   2   3   5   4   3   2   1   1

G   0   0   0   1   2   4   6   6   6   6   6
    Completed Path and Alignment
             A       C       A       C       T       G    A     T       C       G

     0       0       0       0       0       0       0    0     0       0       0
         
A    0       1       0       1       0       0       0    1     0       0       0
                 
C    0       0       2       1       2       1       0    0     1       1       0
                         
A    0       1       1       3       2       2       1    1     0       1       1
                                 
C    0       0       2       2       4       3       2    1     1       1       1
                                         
T    0       0       1       2       3       5       4    3     2       1       1
                                                 
G    0       0       0       1       2       4       6    6   6      6      6

                                 ACACTGATCG
                                 ACACTG----
                    Local Alignments
   Used to find best matching subsequences within two sequences
   Rules for local alignment:
       Horizontal move = -1
       Vertical move = -1
       Diagonal move = +1 for match or -1 for mismatch
       First row and column are initialized to zero
       Place a zero in the table if all other scores are negative for that box
   When determining path, find highest number on table, and work
    back until you come to a zero
   Example: GCGATATA and AACCTATAGCT
            Completed Table
        A    A   C   C   T   A   T   A   G   C   T

    0   0    0   0   0   0   0   0   0   0   0   0

G   0   0    0   0   0   0   0   0   0   1   0   0

C   0   0    0   1   1   0   0   0   0   0   2   1

G   0   0    0   0   0   0   0   0   0   1   1   1

A   0   1    1   0   0   0   1   0   1   0   0   0

T   0   0    0   0   0   1   0   2   1   0   0   1

A   0   1    1   0   0   0   2   1   3   2   1   0

T   0   0    0   0   0   1   1   3   2   2   1   2

A   0   1    1   0   0   0   2   2   4   3   2   1
                            Alignment
        A   A   C   C       T       A       T       A   G   C   T

    0   0   0   0   0       0       0       0       0   0   0   0

G   0   0   0   0   0       0       0       0       0   1   0   0
                                                                     Start with highest
C   0   0   0   1   1       0       0       0       0   0   2   1     value; continue
                                                                    until you reach zero
G   0   0   0   0   0       0       0       0       0   1   1   1

A   0   1   1   0   0       0       1       0       1   0   0   0
                        
T   0   0   0   0   0       1       0       2       1   0   0   1
                                
A   0   1   1   0   0       0       2       1       3   2   1   0
                                        
T   0   0   0   0   0       1       1       3       2   2   1   2      TATA
                                                
A   0   1   1   0   0       0       2       2       4   3   2   1      TATA
              Next…BLAST!

   Let’s let the computer do the work…

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:9/6/2011
language:English
pages:43