# The Needleman Wunsch algorithm for sequence alignment

Document Sample

```					                            Pooja Anshul Saxena
Engr 692: Special Topics – Computational Biology
Global Sequence Alignment
 The Needleman–Wunsch algorithm
performs a global alignment on two
sequences
 It is an example of dynamic programming,
and was the first application of dynamic
programming to biological sequence
comparison
 Suitable when the two sequences are of
similar length, with a significant degree of
similarity throughout
 Aim: The best alignment over the entire
length of two sequences
Three steps in Needleman-
Wunsch Algorithm
 Initialization
 Scoring
 Trace back (Alignment)
 Consider the two DNA sequences to be
globally aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Scoring Scheme

 Match Score = +1
 Mismatch Score = -1
 Gap penalty = -1
 Substitution Matrix

A      C      G    T
A     1      -1     -1   -1
C     -1     1      -1   -1
G     -1     -1     1    -1
T     -1     -1     -1   1
Initialization Step
 Create a matrix with X +1 Rows and Y
+1 Columns
 The 1st row and the 1st column of the
score matrix are filled as multiple of gap
penalty
T      C       G
0       -1     -2      -3
A      -1
T      -2
C      -3
G      -4
Scoring
   The score of any cell C(i, j) is the
maximum of:
scorediag = C(i-1, j-1) + S(I, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
where S(I, j) is the substitution score for
letters i and j, and g is the gap penalty
Scoring ….
   Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = -1 + -1 = -2
scoreleft = C(i, j-1) + g = -1 + -1 = -2
T      C       G
0      -1      -2      -3
A      -1      -1
T      -2
C      -3
G      -4
Scoring ….
   Final Scoring Matrix

T     C       G
0      -1    -2      -3
A      -1      -1    -2      -3
T      -2      0     -1      -2
C      -3      -1    1       0
G      -4      -2    0       2

Note: Always the last cell has the
maximum alignment score: 2
Trace back
 The trace back step determines the
actual alignment(s) that result in the
maximum score
 There are likely to be multiple maximal
alignments
 Trace back starts from the last cell, i.e.
position X, Y in the matrix
 Gives alignment in reverse order
Trace back ….
 There are three possible moves: diagonally
(toward the top-left corner of the matrix),
up, or left
 Trace back takes the current cell and looks
to the neighbor cells that could be direct
predecessors. This means it looks to the
neighbor to the left (gap in sequence #2),
the diagonal neighbor (match/mismatch),
and the neighbor above it (gap in sequence
#1). The algorithm for trace back chooses
as the next cell in the sequence one of the
possible predecessors
Trace back ….
T          C           G
0          -1         -2         -3
A          -1         -1         -2         -3
T          -2          0         -1         -2
C          -3         -1          1          0
G          -4         -2          0          2

  The only possible predecessor is the diagonal match/mismatch
neighbor. If more than one possible predecessor exists, any can be
chosen. This gives us a current alignment of
Seq 1: G
|
Seq 2: G
Trace back ….
   Final Trace back
T    C    G
0    -1   -2   -3
A            -1   -1   -2   -3
T            -2   0    -1   -2
C            -3   -1   1    0
G            -4   -2   0    2

Best Alignment:
ATC G
| | | |
_TCG
Local Sequence Alignment
 The Smith-Waterman algorithm performs
a local alignment on two sequences
 It is an example of dynamic programming
 Useful for dissimilar sequences that are
suspected to contain regions of similarity or
similar sequence motifs within their larger
sequence context
 Aim: The best alignment over the
conserved domain of two sequences
Differences in Needleman-
Wunsch and Smith-Waterman
Algorithms:
 In the initialization stage, the first row
and first column are all filled in with 0s
 While filling the matrix, if a score
becomes negative, put in 0 instead
has the highest score and work back
until a cell with a score of 0 is reached.
Three steps in Smith-Waterman
Algorithm
 Initialization
 Scoring
 Trace back (Alignment)
 Consider the two DNA sequences to be
globally aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Scoring Scheme

 Match Score = +1
 Mismatch Score = -1
 Gap penalty = -1
 Substitution Matrix

A      C      G    T
A     1      -1     -1   -1
C     -1     1      -1   -1
G     -1     -1     1    -1
T     -1     -1     -1   1
Initialization Step
 Create a matrix with X +1 Rows and Y
+1 Columns
 The 1st row and the 1st column of the
score matrix are filled with 0s

T      C      G
0      0      0      0
A      0
T      0
C      0
G      0
Scoring
 The score of any cell C(i, j) is the
maximum of:
scorediag = C(i-1, j-1) + S(I, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
And
0
(here S(I, j) is the substitution score for
letters i and j, and g is the gap penalty)
Scoring ….
   Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = 0 + -1 = -1
scoreleft = C(i, j-1) + g = 0 + -1 = -1
T      C       G
0       0       0       0
A       0       0
T       0
C       0
G       0
Scoring ….
   Final Scoring Matrix

T      C       G
0      0      0       0
A       0      0      0       0
T       0      1      0       0
C       0      0      2       1
G       0      0      1       3

Note: It is not mandatory that the last cell
has the maximum alignment score!
Trace back
 The trace back step determines the
actual alignment(s) that result in the
maximum score
 There are likely to be multiple maximal
alignments
 Trace back starts from the cell with
maximum value in the matrix
 Gives alignment in reverse order
Trace back ….
   There are three possible moves: diagonally
(toward the top-left corner of the matrix), up, or
left
   Trace back takes the current cell and looks to
the neighbor cells that could be direct
predecessors. This means it looks to the
neighbor to the left (gap in sequence #2), the
diagonal neighbor (match/mismatch), and the
neighbor above it (gap in sequence #1). The
algorithm for trace back chooses as the next
cell in the sequence one of the possible
predecessors. This continues till cell with value
0 is reached.
Trace back ….
T          C           G
0           0          0          0
A          0           0          0          0
T          0           1          0          0
C          0           0          2          1
G          0           0          1          3

  The only possible predecessor is the diagonal match/mismatch
neighbor. If more than one possible predecessor exists, any can be
chosen. This gives us a current alignment of
Seq 1: G
|
Seq 2: G
Trace back ….
   Final Trace back
T   C   G
0   0   0   0
A            0   0   0   0
T            0   1   0   0
C            0   0   2   1
G            0   0   1   3

Best Alignment:
TCG
| | |
TCG

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 245 posted: 9/8/2010 language: English pages: 24