# Sequence Alignment by wuyunqing

VIEWS: 11 PAGES: 43

• pg 1
```									Sequence Alignments

BIOL/CHEM 4900

   Chapter 2 in your textbook
Sequence Alignments
   Alignment between characters found in two or
more nucleotide or amino acid sequences
   Amino Acidsresidues; Nucleic Acids
   Similarity between sequences
   What can this tell you?
   In this chapter:
   How do we align two or more sequences?
   How do we evaluate these alignments?
   What conclusions can we make based on these
alignments?
Dot Plots
   Used to visualize regions of similarity
   One sequence placed on the x-axis, the
other on the y-axis
C           
   Dots are placed in the plot where the two
sequences are identical                      T       
   Diagonal lines in plot indicate regions of
similarity                                   A   
   Example: compare ATCG to GATC
G               
   Disadvantages: only gives regions of             A   T   C   G
similarity, not actual alignment
   What would the dot plot look like with
longer sequences?
Noise in Dot Plots
   Control by adjusting the following
   Window size
   Similarity cutoff
   Removing too much noise might conceal small region of similarity
   Example: GCTAGTCAGA and GATGGTCACA

A                                          A

C                                           C

A                                          A

C                                           C
Complete
T                                           T
this plot!
G                                          G
G                                          G

T                                           T

A                                          A

G                                          G

G   C     T   A   G   T   C   A   G   A       G   C    T   A   G   T   C   A   G   A

Window of 1                                   Window of 4
Similarity cutoff of 1                        Similarity cutoff of 3
Dot Plots in Excel
A

C

A

C                            

T                       

G                   

G               

T           

A

G

G   C   T   A   G   T    C   A   G   A
Try the DotPlot Program
   It will automatically save the program and several files to your
desktop
   Open DotPlot application
   Load sequences as FASTA text files
   File, Open Horizontal, Browse
   File, Open Vertical, Browse
   Parameters menu changes length and cutoff
   Draw, Identities shows plot
   Clear screen when change parameters to visualize
   Example: Bos taurus and porcine myoglobin mRNA sequences
(sequences on course website)
Simple Alignments
   Molecular changes occur when organisms evolve
   Mutation
   Most common
   Insertion
   Deletion
   Gaps in alignments
   Added to account for insertions/deletions
   Goal: to obtain optimal alignment
   Most likely to represent the true relationship between homologous sequences
   Consider the following sequences: AATCTATA and AAGATA
   Either 2 insertions in first sequence or 2 deletions in second sequence
   What is the optimal alignment?
   If no gaps allowed, there are three ways the
sequences can be aligned:

AATCTATA             AATCTATA             AATCTATA
|| ||                 |                        |||
AAGATA                AAGATA                AAGATA

   Which alignment is optimal?
   Scoring alignments
   Match score = credit for identical aligned pair
   Mismatch score = penalty for nonidentical residues
   Total score = sum of match and mismatch scores
   Higher score = better alignment
   If gaps are allowed, there are many more ways
the sequences can be aligned
   Three examples:

AATCTATA             AATCTATA             AATCTATA
AAG-AT-A             AA-G-ATA             AA--GATA

   Scoring must now account for gaps
   Gap penalty = penalty for each residue aligned with “–”
   Total score = match + mismatch + gap penalty
    If match = 1, mismatch = 0, and gap penalty = -1,
what are the scores for these three alignments?

AATCTATA                  AATCTATA             AATCTATA
AAG-AT-A                  AA-G-ATA             AA--GATA

Score = 1                          3                           3

Least likely to represent              Which of these
evolutionary relationship          alignments is better?
Gap Penalties
   Is it more likely to have one longer insertion/deletion, or
multiple smaller ones?
   Two types of gap penalties
   Length penalty
   Penalty for each residue aligned with “-”
   Origination penalty
   Penalty for presence of a gap
   Allows differentiation between alignments with many short gaps
and those with fewer, longer gaps
   Further penalizes for rare insertion/deletion (indel) events
    If match = 1, mismatch = 0, length penalty = -1,
and origination penalty = -2, what are the scores
for these three alignments?

AATCTATA                  AATCTATA              AATCTATA
AAG-AT-A                  AA-G-ATA              AA--GATA

Score = -3                         -1                           1

Least likely to represent           Now, which of these
evolutionary relationship           alignments is better?
Terminal Gaps
   Might not actually be indels
   Data could be incomplete
   Sometimes ignored in scoring

AATCTATAGC
AAG--ATA--
Mismatch Penalties
   Different mismatch scores depending on
particular nucleotide or amino acid that is
mismatched
   Reward mismatches that are more likely to
occur (common substitutions)
   Nucleotides
   Purine vs. pyrimidine
   Transitions vs. transversions
Scoring Matrices
   Show scores for all non-gap positions in alignment
   For nucleotide sequences:

A   T    C   G       A     T      C    G          A   T    C    G

A      1   0    0   0   A   5     -4     -4   -4    A    1   -5   -5   -1

T      0   1    0   0   T   -4    5      -4   -4    T   -5   1    -1   -5

C      0   0    1   0   C   -4    -4     5    -4    C   -5   -1   1    -5

G      0   0    0   1   G   -4    -4     -4   5     G   -1   -5   -5   1

Identity (Sparse)            BLAST             Transition/transversion
Matrices for Proteins

   Amino acids
1. Structure and properties
Leucine   Isoleucine   Threonine
   Substitution of similar AAs
more likely to retain
protein function
(conservative substitution)
2. Genetic code
   Minimum number of
nucleotide substitutions
needed to convert a codon
Matrices for Proteins

3. Actual observed substitution rates
   Point accepted mutation (PAM)
   Alignment constructed with high similarity (>85%)
   Calculate relative mutability (mj)
 Number of times one amino acid (j) is substituted by
any other
   Calculate specific substitution (Aij)
 Number of times j is substituted by a specific amino
acid i
   See Box 2.1 (page 40)
PAM Example
   Ambiguities:
   X = ambiguous amino acid
   B = Asn or Asp
   Z = Gln or Glu
   Some algorithms take
ambiguities into account and
score; some count them as
identical; others ignore them
   If the sequence has lots of
ambiguities scores may not be
reliable with certain types of
software

   Identical amino acids = highest score
   Conservative substitution = next highest score
   Non-conservative substitution = lowest score
PAM Matrices
   Pam matrix is normalized to represent substitution over
a fixed period of evolutionary change
   PAM-1
   1 substitution per 100 residues
   Matrix represents probability of AA substitution in time it takes
for 1% of all residues to be substituted
   Used to compare sequences that are closely related
   PAM-1000
   Used for sequences with distant relationships
   PAM-250
   Commonly used middle ground
BLOSUM Matrix
   Also derived from observing substitution rates in
proteins
   Looks at clusters of amino acids sequences
   Lower numbered matrices used for more
distantly related sequences
   BLOSUM-45 vs. BLOSUM-80
   BLOSUM-62 is the middle ground and default matrix
in most protein alignment programs
PAM and BLOSUM

BLOSUM 80        BLOSUM 62    BLOSUM 45
PAM 1           PAM 250      PAM 1000

Less                          More
Divergent                     Divergent
Types of Scores
   Raw Score
   Protein and nucleotide alignments
   Sum the scores for matches, mismatches, and gaps
   Percent identities
   Protein and nucleotide alignments
   Ratio of residues that match up in both sequences to total
number of residues compared
   Percent positives
   Protein alignments only
   Matrix values >1 are called positives
   Ratio of positive values to total number of residues compared
An Example
   Alignment of mouse and crayfish trypsin

Mouse    I   V   G   G Y    N C E E      N   S V    P   Y Q
5   4   5   5 -3   2 -2 –2 –3   0   0 -1   6   10 4
Crayfish I   V   G   G T    D A V L      G   E F    P   Y Q

   Raw score = 30
   % Identities = 7/15 = 47%
   % Positives = 8/15 = 53%
Algorithms for Alignments
   Global
   Dynamic programming
   Breaking a problem down into smaller subproblems, then rebuilding
   Needleman and Wunsch
   Aligns whole sequences
   All gaps accounted for (internal and terminal)
   Semiglobal
   Revised by Needleman and Wunsch
   Aligns whole sequences
   Only internal gaps count
   Local
   Smith and Waterman
   Aligns localized regions of similarity
   Ignore gaps
Partial Scores Table
   Used to align sequences
   Top and left axes labeled with sequences
   Contains alignment scores for all alignment options
   Used to determine optimal alignment
   Example: alignment of ACTCG and ACAGTAG
   Rules for global alignment:
   Horizontal move = -1 (indicates gap in left axis)
   Vertical move = -1 (indicates gap in top axis)
   Diagonal move = +1 for match or 0 for mismatch
   First row and column are initialized with multiples of gap penalty
Initial Partial Scores Table
A    C    T    C    G

0    -1   -2   -3   -4   -5

A   -1

C   -2

A   -3

G   -4

T   -5

A   -6

G   -7
   Start in outlined box
   Calculate the possible scores from diagonal, above, and left
   Put the LARGEST (best) score in the box
   Move across table to complete first row
   Move to second row, etc., until table is complete
A      C      T      C        G

Diagonal = 0 + 1(match) = 1      0     -1     -2     -3     -4       -5
Top = -1 – 1 = -2
Left = -1 – 1 = -2        A     -1

C     -2

A     -3

G      -4

T     -5

A     -6

G      -7
Diagonal = -1 + 0(mismatch) = -1
Top = -2 – 1 = -3
Left = 1 – 1 = 0

A        C    T    C    G

0     -1        -2   -3   -4   -5

A     -1      1

C     -2

A     -3

G     -4

T     -5

A     -6

G     -7
Completed Table
A    C    T    C    G

0    -1   -2   -3   -4   -5

A   -1

C   -2

A   -3
Now, trace the
optimal path.
G   -4
Start at the
bottom right,
T   -5
and move in
the direction
A   -6
that gave that
score. End at
G   -7
the top left.
Completed Table
A    C    T    C    G

0    -1   -2   -3   -4   -5

A   -1   1    0    -1   -2   -3

C   -2   0    2    1    0    -1

A   -3   -1   1    2    1    0    Now, trace the
optimal path.
G   -4   -2   0    1    2    2    Start at the
bottom right,
T   -5   -3   -1   1    1    2    and move in
the direction
A   -6   -4   -2   0    1    1    that gave that
score. End at
G   -7   -5   -3   -1   0    2    the top left.
Completed Path
A        C        T        C        G

0        -1       -2       -3       -4       -5

A   -1       1        0        -1       -2       -3

C   -2       0        2        1        0        -1

A   -3       -1       1        2        1        0

G   -4       -2       0        1        2        2

T   -5       -3       -1       1        1        2

A   -6       -4       -2       0        1        1
        Now, write the
G   -7       -5       -3       -1       0        2    alignment…
Writing the Alignment from the
Partial Scores Table
    means the two residues are aligned
    means there is a gap in top axis
    means there is a gap in left axis
A   C   T   C   G
0  -1 -2 -3 -4 -5                     -TCG
TCG
CG
G
--TCG
C--TCG
AC--TCG

A -1 1 0 -1 -2 -3                       AGTAG
GTAG
TAG
AG
G
CAGTAG
ACAGTAG

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 0 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2
Semiglobal Alignments
   Only internal gaps count
   Do not penalize gaps at ends of sequence
   Rules for semiglobal alignment:
   Horizontal move = -1 (indicates gap in left axis) EXCEPT in
bottom row
   Vertical move = -1 (indicates gap in top axis) EXCEPT in last
column
   Diagonal move = +1 for match or 0 for mismatch
   First row and column are initialized to zero
   Example: align ACACTG and ACACTGATCG
Initial Partial Scores Table
A   C   A   C   T   G   A   T   C   G

0    0   0   0   0   0   0   0   0   0   0

A   0

C   0

A   0

C   0

T   0

G   0
Diagonal = 0 + 0 (mismatch) = 0
Top = 0 – 0 (no penalty last column) = 0
Left = 0 – 1 = -1

A   C     A     C      T     G     A      T    C   G

0   0   0     0     0      0     0      0     0    0   0

A   0   1   0     1     0      0     0      1     0    0

C   0

A   0

C   0

T   0

G   0
Diagonal = 0 + 0 (mismatch) = 0
Top = 0 – 1 = -1
Left = 0 – 0 (no penalty last row) = 0

A   C     A      C     T      G     A        T   C   G

0   0   0     0      0     0      0      0       0   0   0

A   0   1   0     1      0     0      0      1       0   0   0

C   0   0   2     1      2     1      0      0       1   1   0

A   0   1   1     3      2     2      1      1       0   1   1

C   0   0   2     2      4     3      2      1       1   1   1

T   0   0   1     2      3     5      4      3       2   1   1

G   0
Completed Table
A   C   A   C   T   G   A   T   C   G

0   0   0   0   0   0   0   0   0   0   0

A   0   1   0   1   0   0   0   1   0   0   0

C   0   0   2   1   2   1   0   0   1   1   0

A   0   1   1   3   2   2   1   1   0   1   1

C   0   0   2   2   4   3   2   1   1   1   1

T   0   0   1   2   3   5   4   3   2   1   1

G   0   0   0   1   2   4   6   6   6   6   6
Completed Path and Alignment
A       C       A       C       T       G    A     T       C       G

0       0       0       0       0       0       0    0     0       0       0

A    0       1       0       1       0       0       0    1     0       0       0

C    0       0       2       1       2       1       0    0     1       1       0

A    0       1       1       3       2       2       1    1     0       1       1

C    0       0       2       2       4       3       2    1     1       1       1

T    0       0       1       2       3       5       4    3     2       1       1

G    0       0       0       1       2       4       6    6   6      6      6

ACACTGATCG
ACACTG----
Local Alignments
   Used to find best matching subsequences within two sequences
   Rules for local alignment:
   Horizontal move = -1
   Vertical move = -1
   Diagonal move = +1 for match or -1 for mismatch
   First row and column are initialized to zero
   Place a zero in the table if all other scores are negative for that box
   When determining path, find highest number on table, and work
back until you come to a zero
   Example: GCGATATA and AACCTATAGCT
Completed Table
A    A   C   C   T   A   T   A   G   C   T

0   0    0   0   0   0   0   0   0   0   0   0

G   0   0    0   0   0   0   0   0   0   1   0   0

C   0   0    0   1   1   0   0   0   0   0   2   1

G   0   0    0   0   0   0   0   0   0   1   1   1

A   0   1    1   0   0   0   1   0   1   0   0   0

T   0   0    0   0   0   1   0   2   1   0   0   1

A   0   1    1   0   0   0   2   1   3   2   1   0

T   0   0    0   0   0   1   1   3   2   2   1   2

A   0   1    1   0   0   0   2   2   4   3   2   1
Alignment
A   A   C   C       T       A       T       A   G   C   T

0   0   0   0   0       0       0       0       0   0   0   0

G   0   0   0   0   0       0       0       0       0   1   0   0
C   0   0   0   1   1       0       0       0       0   0   2   1     value; continue
until you reach zero
G   0   0   0   0   0       0       0       0       0   1   1   1

A   0   1   1   0   0       0       1       0       1   0   0   0

T   0   0   0   0   0       1       0       2       1   0   0   1

A   0   1   1   0   0       0       2       1       3   2   1   0

T   0   0   0   0   0       1       1       3       2   2   1   2      TATA

A   0   1   1   0   0       0       2       2       4   3   2   1      TATA
Next…BLAST!

   Let’s let the computer do the work…

```
To top