# Lecture 5 Pairwise sequence alignment algorithms

Document Sample

```					           Lecture 5
Pairwise sequence alignment
algorithms

Rachel Karchin
BME 580.488/580.688 and CS 600.488/600.688
Spring 2009
Outline
• Fundamentals and basics
• Exact alignment algorithms
– Global – Needleman-Wunsch
– Local - Smith Waterman
• Heuristics for the real world
– BLAST
Fundamentals

• When we compare sequences, we are looking
for evidence that they have diverged from a
common ancestor by a process of mutation
and selection.

Durbin et al. Biological Sequence Analysis
Fundamentals
• Assume a simplified model of evolution
– Substitutions
– Deletions
– Insertions

Durbin et al. Biological Sequence Analysis
Pairwise alignment basics
Two sequences           GACCC
GACAA

What is the best way to match up their shared equivalent positions?
Pairwise alignment basics
GACCC
Two sequences         GACAA

What is the best way to match up their shared equivalent positions?

GACCC        GACCC--     GACC-C
GACAA        --GACAA     GACAA-

G-ACCC     GACCC-       GACCC-
GACAA-     GA-CAA       GAC-AA
Pairwise alignment basics
Two sequences   GACCC           Tabular representation
GACAA

G         A       C   A      A
G
A
C
C
C
Pairwise alignment basics
GACCC           Tabular representation
Two sequences   GACAA

G         A       C   A      A
G
A
C
C
C
Pairwise alignment basics
GACCC-           Tabular representation
G-ACAA

G   A        C   A      A
G
A
C
C
C
Pairwise alignment basics
GACCC-       Tabular representation
G-ACAA

G   A        C   A      A
G
A
C
C
C
Pairwise alignment
• Which of all of these alignments is the best
one?
Pairwise alignment
• Requires a scoring system that gives points or
penalties for matches and gaps
Pairwise alignment
• Also requires efficient search strategy to select
the best alignment or best k alignments out of
very large number of possible alignments
Scoring system
• “Match” points
• Indel costs
What do we want from a scoring
system?
What do we want from a scoring
system?
• Better scores for pairs of sequences that have
diverged from a common ancestor
• Worse scores for pairs that have not.
What do we want from a search
strategy?
DNA/RNA vs. protein sequences
• Alphabet size changes

4 nucleotide bases                                           20 amino acid residues

http://student.ccbcmd.edu/~gkaiser/biotutorials/dna/images/DNAbases.jpg
Scoring matrices
• “Score” (positive or negative) associated with each
possible amino acid substitution
Inuition of scoring matrices

• Over time, each amino acid residue has a
probability of evolving into any of the other 19
amino acid residues
• We can estimate that probability by counting
substitutions in multiple sequence alignments of
protein families.
Scoring matrices
• PAM matrices invented
by Margaret Dayhoff in
1978.
Scoring matrices
• Reconstructed “ancestral
sequence” for ~70 protein
families.
• Counted number of changes
for each amino acid and
divided by a normalization
factor of “exposure to
mutation” (time)
• Assumes that the observed
substitution is the result of
a single mutation event.
Scoring matrices

Do these look like probabilities?
Scoring matrices

 P  alignment i and j arose through evolution  
Log                                                  
      P  alignment i and j arose by chance     
                                                 

Do these look like probabilities?
Scoring matrices
• Henikoff and Henikoff
(1992)
• Based on amino acid
substitutions in
conserved amino acid
patterns from the
BLOCKS database
• BLOSUMn from blocks
which are n% identical
Scoring matrices
• BLOCKS found with
motif software which
finds patterns of amino
acids at arbitrary
intervals
• Protein families defined
by presence of a
common motif
• Motifs strung together
into ungapped blocks of
sequence
BLOCKS
http://blocks.fhcrc.org/
Nucleotide substitution matrices
• How might you design one?
Gap penalties
• Constant gap penalty
• Linear gap penalty
• Affine gap penalty
Constant Gap penalties
• Simplest approach
– Cost for opening a gap δ
– Every gap gets same penalty, regardless of length
Linear Gap penalties
• Cost for opening a gap δ
• Single cost for gap extenstion ε
Affine Gap penalties
• A linear transformation followed by a
translation
• Gap opening penalty δ and linear gap
extension penalty ε
• Gap of length k gets penalty δ + (k-1) ε
Gap penalties
• Constant gap penalty
• Linear gap penalty
• Affine gap penalty

Which works best?

How do we set parameters for δ and ε?
Constant Gap penalties
Scoring system
PIK3CA COW
First ~100 aa of both   Substitution BLOSUM62
PIK3CB HUMAN
Gap constant δ = -10
Constant Gap penalties
Scoring system
PIK3CA COW
First ~100 aa of both   Substitution BLOSUM62
PIK3CB HUMAN
Gap constant    δ = -5

What has changed?
Constant Gap penalties
Scoring system
PIK3CA COW
First ~100 aa of both   Substitution BLOSUM62
PIK3CB HUMAN
Gap constant     δ = 0

What has changed?
Constant Gap penalties
Scoring system
PIK3CA COW
First ~100 aa of both      Substitution BLOSUM62
PIK3CB HUMAN
Gap constant     δ = 0

What has changed?
How might we assess which
value of δ is best?
Linear Gap penalties
Scoring system
PIK3CA COW
First ~100 aa of both   Substitution BLOSUM62
PIK3CB HUMAN
Gap linear      δ = -10
ε = -1

What has changed?
Linear Gap penalties
• Each gap of one base (or residue) gets
penalized the same, regardless of whether or
not it is contiguous with other gaps
• Alignment with fewer gaps favored
• Penalty for large gap is same as many small
gaps

biological point of view?
Affine Gap penalties
• Biologically, more likely to see a single gap of
10 than 10 single gaps
• We can model this using a gap opening
penalty δ and a gap extension penalty ε
• Gap of length k gets penalty δ + (k-1)ε

Should ε be more or less negative than δ ?
Why?
Affine Gap penalties
• Biologically, more likely to see a single gap of
10 than 10 single gaps
• We can model this using a gap opening
penalty δ and a gap extension penalty ε
• Gap of length k gets penalty δ + (k-1)ε

Should ε be more or less negative than δ ?
Why?

Less. Because a few large gaps are more
realistic than many small gaps. Want to
encourage gap extension.
Gap penalties
pairwise sequence alignment.

http://fasta.bioch.virginia.edu/noptalign/

• Gap penalty functions often written as ω(.)
Dynamic programming for sequence
alignment
• Scoring system
• Search strategy
Dynamic programming for sequence
alignment
• Key insight is that we can align prefixes and
then extend them one position at a time
• Efficient search through the huge alignment
space is done by caching intermediate results
Dynamic programming for sequence
alignment
• General idea               A        C        A        A
0    -2       -2       -2       -2
G -2
Matrix
Initialization   A -2
C -2
Two sequences     C -2
GACCC
ACAA
C -2
Dynamic programming for sequence
alignment
• General idea               A        C        A        A
0    -2       -2       -2       -2
G -2
Matrix
Initialization   A -2
Why?
C -2
Two sequences     C -2
GACCC
ACAA
C -2
Dynamic programming for sequence
alignment
• General idea                                          A        C        A        A
Aligning each nucleotide
with the empty string is
equivalent to putting a gap at
0    -2       -2       -2       -2
the beginning of the alignment.

What kind of gap penalty is being used here?   G -2
Matrix
Initialization                             A -2
Why?
C -2
Two sequences                                 C -2
GACCC
ACAA
C -2
Dynamic programming for sequence
alignment
• General idea                        A        C        A        A
What is this?
0        -2       -2       -2       -2
G -2
Matrix
Initialization      A -2
C -2
Two sequences         C -2
GACCC
ACAA
C -2
Dynamic programming for sequence
alignment
• General idea                                         A        C        A        A
What is this?
Score for aligning two empty sequences     0    -2       -2       -2       -2
G -2
Matrix
Initialization                         A -2
C -2
Two sequences                               C -2
GACCC
ACAA
C -2
Dynamic programming for sequence
alignment
• General idea                                   A        C        A        A

Matrix                              0    -2       -2       -2       -2
Fill
Recurrence relation                    G -2
M(i,j) = max(
M(i-1,j-1) +S(i,j),
A -2
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    C -2
)

C -2
C -2
To fill in each M(i,j), need to look where?

• General idea                                        A          C         A        A

Matrix                              0         -2         -2        -2       -2
Fill
Recurrence relation                    G -2
M(i,j) = max(
M(i-1,j-1) +S(i,j),
A -2
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    C -2
)

C -2
C -2
M(i-1,j-1)   M(i-1,j)

To fill in each M(i,j), need to look where?        M(i,j-1) M(i,j)

• General idea                                        A          C         A                A

Matrix                              0         -2         -2        -2            -2
Fill
Recurrence relation                    G -2
M(i,j) = max(
M(i-1,j-1) +S(i,j),
A -2
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    C -2
)

C -2
C -2
Dynamic programming for sequence
alignment
• General idea                                   A        C        A        A

Matrix                              0    -2       -2       -2       -2
Fill
Recurrence relation                    G -2
M(i,j) = max(
M(i-1,j-1) +S(i,j),
A -2
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    C -2
)

C -2
What else do we need here?

C -2
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0        0
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0        0
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0        0        0
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0        0        0
M(i-1,j) + ω(gap in seq 2)
)
A -2
Scoring system:
Pur   Pyr
C -2
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2

C -2
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
Recurrence relation
A        C        A        A
M(i,j) = max(                            0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G -2   2        0        0        0
)
A -2   0
Scoring system:

S(i,j)
Pur   Pyr               C -2
Pur   2     -1

Pyr   -1    2                 C -2

ω(.)= -2                     C -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
Recurrence relation                                A        C        A       A
M(i,j) = max(
M(i-1,j-1) +S(i,j),             0    -2       -2       -2       -2
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)
)
G -2   2        0        0        0

Scoring system:                    A -2   0
Pur   Pyr
S(i,j)
Pur   2     -1
C -2
Pyr   -1    2
C -2
ω(.)= -2                     C -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0        0        0
M(i-1,j) + ω(gap in seq 2)
)
A -2   0        1        2        2
Scoring system:
Pur   Pyr
C -2   0        2        3        4
S(i,j)
Pur   2     -1
C -2   0        2        4        5
Pyr   -1    2

C -2   0        2        4        6
ω(.)= -2
constant
Dynamic programming for sequence
Matrix
alignment
Fill
A        C        A        A
Recurrence relation

M(i,j) = max(
0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),   G -2   2        0        0        0
M(i-1,j) + ω(gap in seq 2)
)
A -2   0        1        2        2
Scoring system:
Pur   Pyr
C -2   0        2        3        4
S(i,j)
Pur   2     -1
C -2   0        2        4        5
Pyr   -1    2

C -2   0        2        4        6
ω(.)= -2
constant
Dynamic programming for sequence
alignment
Only one “winner” is shown for each max
operation, but there are often ties, with the
winner selected at random. Thus, there are
many optimal alignments
and even more that are slightly less than
optimal (“sub-optimal alignments”)
A        C        A        A
M(i,j) = max(                                     0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)             G -2   2        0        0        0
)
A -2   0        1        2        2
Scoring system:

S(i,j)
Pur    Pyr                    C -2   0        2        3        4
Pur     2      -1

Pyr     -1     2                      C -2   0        2        4        5

ω(.)= -2                            C -2   0        2        4        6
constant
Dynamic programming for sequence
Matrix
alignment       Complexity?
Fill
Recurrence relation
A        C        A        A
M(i,j) = max(                            0    -2       -2       -2       -2
M(i-1,j-1) +S(i,j),
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G -2   2        0        0        0
)
A -2   0        1        2        2
Scoring system:

S(i,j)
Pur   Pyr               C -2   0        2        3        4
Pur   2     -1

Pyr   -1    2                 C -2   0        2        4        5

ω(.)= -2                     C -2   0        2        4        6
constant
Dynamic programming for sequence
alignment
Traceback
A        C        A        A
0    -2       -2       -2       -2
Where do we start?
G -2   2        0        0        0
A -2   0        1        2        2
C -2   0        2        3        4
C -2   0        2        4        5
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                            A        C        A        A
0    -2       -2       -2       -2
Are we doing a global
or local alignment here?   G -2   2        0        0        0
A -2   0        1        2        2
C -2   0        2        3        4
C -2   0        2        4        5
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                             A        C        A        A

Are we doing a global
0    -2       -2       -2       -2
or local alignment here?
G -2   2        0        0        0
A -2   0        1        2        2
C -2   0        2        3        4
Bottom right cell gives   C -2   0        2        4        5
us the optimal global
alignment score
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                                  A        C        A        A
0    -2       -2       -2       -2
Want to identify the alignment
with over the length of both
sequences                      G   -2   2        0        0        0
Backtrack over the winning   A -2       0        1        2        2
path

How?
C -2      0        2        3        4
C -2      0        2        4        5
C -2      0        2        4        6
Dynamic programming for sequence
alignment
Traceback                                  A        C        A        A

Want to recover the (an)
0    -2       -2       -2       -2
optimal alignment over the
length of both sequences         G -2   2        0        0        0
Backtrack over the winning       A -2   0        1        2        2
path

How?                             C -2   0        2        3        4
Have to identify the direction   C -2   0        2        4        5
into which each cell was
reached.
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                                  A        C        A        A

Want to recover the (an)
0    -2       -2       -2       -2
optimal alignment over the
length of both sequences         G -2   2        0        0        0
Backtrack over the winning       A -2   0        1        2        2
path

How?                             C -2   0        2        3        4
Have to identify the direction   C -2   0        2        4        5
into which each cell was
reached.
C -2   0        2        4        6
Forward pointers tell us
Dynamic programming for sequence
alignment
Traceback
A        C        A        A
Follow the path backwards         0    -2       -2       -2       -2
Identify the (a) winning path
G -2   2        0        0        0
A -2   0        1        2        2
C -2   0        2        3        4
C -2   0        2        4        5
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                                 A        C        A        A

0    -2       -2       -2       -2

Identify the (a) winning path   G -2   2        0        0        0
A -2   0        1        2        2
C -2   0        2        3        4
C -2   0        2        4        5
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                                 A        C        A        A
0    -2       -2       -2       -2

Identify the (a) winning path G   -2   2        0        0        0
What does our alignment     A -2       0        1        2        2
look like at this point?
C -2      0        2        3        4
C -2      0        2        4        5
C -2      0        2        4        6
Dynamic programming for sequence
alignment
Traceback
A        C        A        A

Follow the path backwards         0    -2       -2       -2       -2
Identify the (a) winning path   G -2   2        0        0        0
What does our alignment
look like at this point?        A -2   0        1        2        2
C -2   0        2        3        4
CCC
CAA               C -2   0        2        4        5
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback
A        C        A        A

Follow the path backwards         0    -2       -2       -2       -2
Identify the (a) winning path   G -2   2        0        0        0
A -2   0        1        2        2
C -2   0        2        3        4
ACCC
ACAA              C -2   0        2        4        5
C -2   0        2        4        6
Dynamic programming for sequence
alignment
Traceback                                 A        C        A        A

0    -2       -2       -2       -2

Identify the (a) winning path G   -2   2        0        0        0
A -2       0        1        2        2
C -2      0        2        3        4
GACCC          C -2      0        2        4        5
-ACAA

C -2      0        2        4        6
Global pairwise sequence alignment
• Global pairwise sequence alignment by
dynamic programming introduced by
Needleman and Wunsch in
• Known as the Needleman-Wunsch algorithm.
• In what situations might we want a global
sequence alignment?

• In what situations might this not be desirable?
Local pairwise sequence alignment
• General idea                                 C       C       A       A
Recurrence relation
0   0       0       0       0
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),           G 0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)
)                             A 0
Scoring system:
C 0
Pur   Pyr
S(i,j)
Pur   2     -1
C 0
Pyr   -1    2

C 0
ω(.)= -2
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0
)

Zero “win” => start of
A 0
a new alignment (reset)
C 0
Alignment can begin and
end anywhere in the matrix            C 0
C 0
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
C         C          A           A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0        0         0          0           0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0        0
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0       0
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0       0       2
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
C         C          A           A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0        0         0          0           0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0        0         0          2
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0       0       2
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0       0       2       2
)

Scoring system:                 A 0
Pur   Pyr
S(i,j)                               C 0
Pur   2     -1

Pyr   -1    2
C 0
ω(.)= -2                   C 0
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0       0       2       2
)

Scoring system:                 A 0   0       0       2       4
Pur   Pyr
S(i,j)                               C 0   2       2       2       4
Pur   2     -1

Pyr   -1    2
C 0   2       4       4       4
ω(.)= -2                   C 0   2       4       6       6
constant
Local pairwise sequence alignment
Recurrence relation
C       C       A       A
M(i,j) = max(
0,
M(i-1,j-1) +S(i,j),             0   0       0       0       0
M(i,j-1) + ω(gap in seq 1),
M(i-1,j) + ω(gap in seq 2)    G 0   0       0       2       2
)

Scoring system:                 A 0   0       0       2       4
Pur   Pyr
S(i,j)                               C 0   2       2       2       4
Pur   2     -1

Pyr   -1    2
C 0   2       4       4       4
ω(.)= -2                   C 0   2       4       6       6
constant
Local pairwise sequence alignment
Traceback
C       C       A       A
0   0       0       0       0
Want to recover the (an)
optimal alignment over the   G 0   0       0       2       2
length of both sequences

Backtrack over the winning   A 0   0       0       2       4
paths
C 0   2       2       2       4
How?

Where do we start?           C 0   2       4       4       4
C 0   2       4       6       6
Local pairwise sequence alignment
Traceback
C       C       A       A
0   0       0       0       0
Want to recover the (an)
optimal alignment over the     G 0   0       0       2       2
length of both sequences

Backtrack over the winning     A 0   0       0       2       4
paths
C 0   2       2       2       4
How?

Where do we start?             C 0   2       4       4       4
Largest value in the matrix
is score of the best local    C 0   2       4       6       6
alignment between
subsequences
Local pairwise sequence alignment
Traceback
C       C       A       A
0   0       0       0       0
G 0   0       0       2       2
A 0   0       0       2       4
C 0   2       2       2       4
C 0   2       4       4       4
C 0   2       4       6       6
Local pairwise sequence alignment
Traceback
C       C       A       A
0   0       0       0       0
G 0   0       0       2       2
A 0   0       0       2       4
C 0   2       2       2       4
C 0   2       4       4       4
C 0   2       4       6       6
Local pairwise sequence alignment
Traceback
C       C       A       A
0   0       0       0       0
G 0   0       0       2       2
A 0   0       0       2       4
C 0   2       2       2       4
C 0   2       4       4       4
C 0   2       4       6       6
Local pairwise sequence alignment
Traceback
C       C       A       A
What does this alignment
look like?
0   0       0       0       0
G 0   0       0       2       2
A 0   0       0       2       4
C 0   2       2       2       4
C 0   2       4       4       4
C 0   2       4       6       6
Local pairwise sequence alignment
Traceback
C       C       A       A
What does this alignment
look like?
0   0       0       0       0
G 0   0       0       2       2
A 0   0       0       2       4
C 0   2       2       2       4

CCC          C 0   2       4       4       4
CCA
C 0   2       4       6       6
Local pairwise sequence alignment
Traceback
C       C       A       A
For this to work, what is
requirement for expected
0   0       0       0       0
value of a random
match?                      G 0   0       0       2       2
Why?                        A 0   0       0       2       4
C 0   2       2       2       4
C 0   2       4       4       4
C 0   2       4       6       6
Local pairwise sequence alignment
• Elegant theory
– Smith-Waterman (1981)
– Gotoh (affine gap scores) (1982)
• Trend in past ten years has been towards
heuristic, approximation approaches
Motivation for heuristic alignments
• For two sequences of length m and n
– Time to get optimal alignment score is Θ(mn)
• Θ notation like O notation but provides upper and
lower bounds on asymptotic behavior (O is only for
upper)
– Space required to recover the alignment (naively Θ(mn)
but can reduce to Θ(m+n)
• Not realistic for large scale problems
– Sequence (or sequence database size) in the billions
– We are now in the era of whole-genome alignment
Big picture strategy for
heuristic alignments
• Do fast preprocessing to identify regions
where more expensive algorithms can be used

Generate candidate       Filter candidates by
patterns that indicate   seeking a high scoring
presence of a high       alignment near                Repeat with
scoring alignment        each one                      progressively
more
sensitive
algorithms

Courtesy of WU lecture 1
Big picture strategy for
heuristic alignments

– What are good candidate patterns?
– How do we generate and find them efficiently?
– How do we eliminate redundant candidates?
– How do we search for alignments near
candidates?
– How do we score (and filter) alignments?
Courtesy of WU lectures on alignment problems
BLAST
Pairwise
sequence alignments   Score/Pval
Target
S1
S2
S3
Database                           S4
S5
S6
S7
S8
S9
BLAST
Biologically real alignments very likely
to contain a short stretch of identities
or very high scoring matches.

Target

Database

Parallel computing for bioinformatics and computational biology, Zomaya
BLAST

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

Q   ACTGA       D   GACTGC

Word List
ACT
CTG
TGA

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

Q   ACTGA       D   GACTGC

Word List   Seeds?
ACT
CTG                What are they?
TGA

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

Q   ACTGA       D     GACTGC

Word List   Seeds?
ACT           ACT
CTG           CTG
TGA

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

Q     ACTGA                   D        GACTGC

Word List                  Seeds?
ACT                       ACT
CTG                       CTG
TGA

What kind of data structures might you use to store the words?
The word/seed associations?      Parallel computing for bioinformatics and computational biology,   Wu and Tseng,
Editor: Zomaya
BLAST

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

HSP
(high-scoring pair)

GACTGC                           GACTGC
ACT                              ACTG
CTG
D     GACTGC

Seeds?
ACT
CTG

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

• Combine consistent HSPs
– They can’t overlap
– They must be in same order in Q and D
sequences
– These “consistent local aligments” are assessed
for statistical significance

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
BLAST

• Parameterization includes choice of W (word
length) and T (threshold for scores in the
initial word list)

How will make W and T longer or shorter impact the final
list of statistically significant local alignments produced by BLAST?

Parallel computing for bioinformatics and computational biology, Wu and Tseng,
Editor: Zomaya
What you should know
• What alignments mean in terms of a model of
sequence evolution
• Start to develop intuition about dynamic
programming and how to use it
• Basic ideas behind heuristic alignment
approach such as BLAST
References
• Materials for this lecture were taken from many
sources including the following web pages, which
you may want to explore on your own
http://www.life.umd.edu/labs/delwiche/bsci348s/lec/PAMmatrices.html

http://web.cecs.pdx.edu/~ps/CapStone03/dynvis/SimilarityApplet.html
http://fasta.bioch.virginia.edu/noptalign/

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 15 posted: 6/1/2010 language: English pages: 117
How are you planning on using Docstoc?