# CS420 lecture one Problems_ algorithms

Document Sample

```					         CS575dl
Smith-Waterman
Sequence Alignment in CUDA
Doug Hains, Wim Bohm
CS, CSU
Sequence Alignment
did you mean occurrence?
using a measure of similarity of strings based
on gaps and mismatches:
o-currance
occurrence
1 gap, 1 mismatch
Bio Informatics
• Genome: string of DNA {A,C,G,T} characters
• Amino acid: string of proteins (up to 20 characters)
• Important question: How "nearly similar" is string X
compared to string Y?
• 1970-s Needleman and Wunsch (NW) defined
"similarity" and a Dynamic Programming solution
• 1980-s Smith and Waterman (SW) slightly changed
similarity definition and hence algorithm
• There is a CUDA implementation CUDASW++ of SW
that we improved
Formulation
• X: x1x2...xm      Y:y1y2...yn
• A matching is a set of ordered pairs (i,j), i from
{1,...,m} and j from {1,...,n}, so that each i and j occur
in at most one pair: stop vs tops (2,1) (3,2) (4,3)
• An alignment is a matching without crossings: if (i,j)
and (i',j') are in the alignment then i<i' implies j<j'

j-s
i-s
Formulation
• X: x1x2...xm      Y:y1y2...yn
• A matching is a set of ordered pairs (i,j), i from
{1,...,m} and j from {1,...,n}, so that each i and j occur
in at most one pair: stop vs tops (2,1) (3,2) (4,3)
• An alignment is a matching without crossings: if (i,j)
and (i',j') are in the alignment then i<i' implies j<j'

j' < j
j-s
i-s
i < i'
Optimal Sequence Alignment
Let M be an alignment between X and Y, then
1) There is a parameter g, defining the gap penalty,
for each position in X or Y not matched.
2) There is a match/mismatch scoring matrix Spq for
lining up p with q.
3) Matches, mismatches and gaps infer an alignment
score of M. Match scores are positive, mismatches
and gaps are negative.
4) We want to maximize the score.
example
• Two alignments for aabc and abb
aabc
-abb g + Saa + Sbb + Scb

aabc
abb-     Saa + Sab + Sbb + g

best alignment determined by the values of S and g
Structure
Either (Xm,Yn), the last chars of X and Y, are in the
optimal alignment, or not. If not, Xm and Yn
cannot both be in another alignment, because
that would cause a crossing. Eg g in X and c in Y
cannot both match with t in

a a     t   g
a a     t   c
Structure
• Therefore, in an optimal alignment M, 1 of the
following holds (we don't align two gaps):
(m,n) in M aligned, either matched or mismatched
m not in M (mth pos of X not matched)
n not in M (nth pos of Y not matched)
• This gives rise to a recurrence
Cij= max( Ci-1,j-1+SXi,Yj, Ci-1,j+g, Ci,j-1+g)
Ci0= i* gap penalty
C0j= j* gap penalty
Dynamic programming solution
Cij= max( Ci-1,j-1+ SXi,Yj, Ci-1,j+g, Ci,j-1+g)
Ci0= i * g
C0j= j * g
We build C bottom up, starting from row Ci0 and col C0j
and filling in elements in dataflow order, obeying the
data dependences: Ci-1,j-1 Ci-1,j

Ci,j-1 Ci,j
What viable orders would we have?
Global Sequence Alignment
• The Needleman–Wunsch dynamic programming
algorithm performs a global alignment on two
sequences, matching the complete sequences
from beginning to end

• Three steps
– Initialization
– Scoring
• Based on the recurrence
– Trace back (Alignment)
Scoring: similarity and gap penalty

• Using a similarity matrix S and gap penalty g,
eg (Wikipedia) g = -5, S =
A      G       C      T
A      10     -1      -3     -4
G       -1     7       -5     -3
C      -3     -5      9      0
T      -4     -3      0      8
Initialization Step
• Aligning eg ATCG with TCGA; create a score
matrix with 5 rows and 5 columns, with the
1st row and column multiples of gap penalty

T     C     G     A
0     -5    -10   -15   -20
A     -5
T     -10
C     -15
G     -20
Max score

• The score of cell C[i, j] is the maximum of:

scorediag = C[i-1, j-1] + S[Xi, Yj]
scoreup = C[i-1, j] + g
scoreleft = C[i, j-1] + g
Score matrix
• C                       T         C          G         A
0          -5        -10        -15       -20
A        -5         -4        -8         -11       -5
T       -10         3         -4         -9        -10
C       -15         -2        12         7          2
G       -20         -7        7          19        14

• S       A          G         C          T
g = -5
A   10         -1        -3         -4
G   -1         7         -5         -3
C   -3         -5        9          0
T   -4         -3        0          8
Trace back
• The trace back step determines the actual
alignment(s)
– There are likely to be multiple maximal alignments
• Trace back starts from the bottom right cell
• Gives alignment in reverse order
• Considers the three possible steps back and
picks one that yields the (best) score in
current cell.
Trace back from bottom right

T    C     G     A
0          -5   -10   -15   -20
A        -5         -4   -8    -11   -5
T        -10        3    -4    -9    -10
C        -15        -2   12    7     2
G        -20        -7   7     19    14

Best Alignment:
ATCG_
_ T C GA
Local Sequence Alignment

• The Smith-Waterman dynamic programming
algorithm performs a local alignment on two
sequences, finding a maximum alignment score
for a subsequence of X and a subsequence of Y
• It is useful for dissimilar sequences with regions
of similarity
Needleman-Wunsch vs Smith-Waterman
• Same alignment structure, very similar recurrence:
Cij= max( 0, Ci-1,j-1+ S[Xi,Yj], Ci-1,j+g, Ci,j-1+g)
Ci0= C0j= 0
• Consequence: while filling the score matrix, if a score
becomes negative, put a 0 instead
highest score and work back until a cell with a score
of 0 is reached
Example GACG vs AACA
A     A       C    A
0    0     0       0     0
G     0    0     0       0     0
A     0    10   10       5    10
C     0    5     7    19      14
G     0    0     4    14      18

• S:         A    G    C    T
g=-5
A     10   -1   -3   -4
While building the
G     -1   7    -5   -3
table, keep track of
C     -3   -5   9    0
max score and its
T     -4   -3   0    8
position
Gap penalty
• The standard SW algorithm makes a difference
between the penalty of starting a gap (-ρ) and
of extending a gap (-σ).
• This is expressed using 3 recurrences
(implemented using 3 tables):
Hij= max(0,Eij,Fij,Hi-1,j-1+S[Xi,Yj])
Eij= max(Ei,j-1-σ,Hi,j-1-ρ)
Fij= max(Fi-1,j-σ,Hi-1,j-ρ)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 11 posted: 7/13/2011 language: English pages: 21