Docstoc

CS420 lecture one Problems_ algorithms

Document Sample
CS420 lecture one Problems_ algorithms Powered By Docstoc
					         CS575dl
     Smith-Waterman
Sequence Alignment in CUDA
     Doug Hains, Wim Bohm
            CS, CSU
          Sequence Alignment
• When you Google "ocurrance" Google says:
  did you mean occurrence?
  using a measure of similarity of strings based
  on gaps and mismatches:
   o-currance
   occurrence
  1 gap, 1 mismatch
                Bio Informatics
• Genome: string of DNA {A,C,G,T} characters
• Amino acid: string of proteins (up to 20 characters)
• Important question: How "nearly similar" is string X
  compared to string Y?
• 1970-s Needleman and Wunsch (NW) defined
  "similarity" and a Dynamic Programming solution
• 1980-s Smith and Waterman (SW) slightly changed
  similarity definition and hence algorithm
• There is a CUDA implementation CUDASW++ of SW
  that we improved
                   Formulation
• X: x1x2...xm      Y:y1y2...yn
• A matching is a set of ordered pairs (i,j), i from
  {1,...,m} and j from {1,...,n}, so that each i and j occur
  in at most one pair: stop vs tops (2,1) (3,2) (4,3)
• An alignment is a matching without crossings: if (i,j)
  and (i',j') are in the alignment then i<i' implies j<j'

    j-s
    i-s
                   Formulation
• X: x1x2...xm      Y:y1y2...yn
• A matching is a set of ordered pairs (i,j), i from
  {1,...,m} and j from {1,...,n}, so that each i and j occur
  in at most one pair: stop vs tops (2,1) (3,2) (4,3)
• An alignment is a matching without crossings: if (i,j)
  and (i',j') are in the alignment then i<i' implies j<j'

                              j' < j
    j-s
    i-s
                            i < i'
    Optimal Sequence Alignment
Let M be an alignment between X and Y, then
1) There is a parameter g, defining the gap penalty,
   for each position in X or Y not matched.
2) There is a match/mismatch scoring matrix Spq for
   lining up p with q.
3) Matches, mismatches and gaps infer an alignment
   score of M. Match scores are positive, mismatches
   and gaps are negative.
4) We want to maximize the score.
                    example
• Two alignments for aabc and abb
         aabc
    -abb g + Saa + Sbb + Scb

    aabc
    abb-     Saa + Sab + Sbb + g

best alignment determined by the values of S and g
                   Structure
Either (Xm,Yn), the last chars of X and Y, are in the
optimal alignment, or not. If not, Xm and Yn
cannot both be in another alignment, because
that would cause a crossing. Eg g in X and c in Y
cannot both match with t in

                    a a     t   g
                    a a     t   c
                            Structure
• Therefore, in an optimal alignment M, 1 of the
  following holds (we don't align two gaps):
   (m,n) in M aligned, either matched or mismatched
    m not in M (mth pos of X not matched)
    n not in M (nth pos of Y not matched)
• This gives rise to a recurrence
    Cij= max( Ci-1,j-1+SXi,Yj, Ci-1,j+g, Ci,j-1+g)
    Ci0= i* gap penalty
    C0j= j* gap penalty
   Dynamic programming solution
     Cij= max( Ci-1,j-1+ SXi,Yj, Ci-1,j+g, Ci,j-1+g)
     Ci0= i * g
     C0j= j * g
We build C bottom up, starting from row Ci0 and col C0j
and filling in elements in dataflow order, obeying the
data dependences: Ci-1,j-1 Ci-1,j

                    Ci,j-1 Ci,j
What viable orders would we have?
     Global Sequence Alignment
• The Needleman–Wunsch dynamic programming
  algorithm performs a global alignment on two
  sequences, matching the complete sequences
  from beginning to end

• Three steps
  – Initialization
  – Scoring
     • Based on the recurrence
  – Trace back (Alignment)
Scoring: similarity and gap penalty

• Using a similarity matrix S and gap penalty g,
  eg (Wikipedia) g = -5, S =
               A      G       C      T
        A      10     -1      -3     -4
       G       -1     7       -5     -3
        C      -3     -5      9      0
        T      -4     -3      0      8
            Initialization Step
• Aligning eg ATCG with TCGA; create a score
  matrix with 5 rows and 5 columns, with the
  1st row and column multiples of gap penalty

                  T     C     G     A
            0     -5    -10   -15   -20
      A     -5
      T     -10
      C     -15
      G     -20
                  Max score

• The score of cell C[i, j] is the maximum of:

   scorediag = C[i-1, j-1] + S[Xi, Yj]
   scoreup = C[i-1, j] + g
   scoreleft = C[i, j-1] + g
                    Score matrix
• C                       T         C          G         A
               0          -5        -10        -15       -20
      A        -5         -4        -8         -11       -5
      T       -10         3         -4         -9        -10
      C       -15         -2        12         7          2
      G       -20         -7        7          19        14

• S       A          G         C          T
                                                     g = -5
      A   10         -1        -3         -4
      G   -1         7         -5         -3
      C   -3         -5        9          0
      T   -4         -3        0          8
                  Trace back
• The trace back step determines the actual
  alignment(s)
  – There are likely to be multiple maximal alignments
• Trace back starts from the bottom right cell
• Gives alignment in reverse order
• Considers the three possible steps back and
  picks one that yields the (best) score in
  current cell.
      Trace back from bottom right

                             T    C     G     A
                  0          -5   -10   -15   -20
         A        -5         -4   -8    -11   -5
         T        -10        3    -4    -9    -10
         C        -15        -2   12    7     2
         G        -20        -7   7     19    14

Best Alignment:
                       ATCG_
                       _ T C GA
       Local Sequence Alignment

• The Smith-Waterman dynamic programming
  algorithm performs a local alignment on two
  sequences, finding a maximum alignment score
  for a subsequence of X and a subsequence of Y
• It is useful for dissimilar sequences with regions
  of similarity
Needleman-Wunsch vs Smith-Waterman
• Same alignment structure, very similar recurrence:
     Cij= max( 0, Ci-1,j-1+ S[Xi,Yj], Ci-1,j+g, Ci,j-1+g)
     Ci0= C0j= 0
• Consequence: while filling the score matrix, if a score
  becomes negative, put a 0 instead
• In the traceback, start with the cell that has the
  highest score and work back until a cell with a score
  of 0 is reached
           Example GACG vs AACA
                  A     A       C    A
             0    0     0       0     0
       G     0    0     0       0     0
       A     0    10   10       5    10
       C     0    5     7    19      14
       G     0    0     4    14      18


• S:         A    G    C    T
                                    g=-5
       A     10   -1   -3   -4
                                    While building the
       G     -1   7    -5   -3
                                    table, keep track of
       C     -3   -5   9    0
                                    max score and its
       T     -4   -3   0    8
                                    position
                Gap penalty
• The standard SW algorithm makes a difference
  between the penalty of starting a gap (-ρ) and
  of extending a gap (-σ).
• This is expressed using 3 recurrences
  (implemented using 3 tables):
       Hij= max(0,Eij,Fij,Hi-1,j-1+S[Xi,Yj])
       Eij= max(Ei,j-1-σ,Hi,j-1-ρ)
       Fij= max(Fi-1,j-σ,Hi-1,j-ρ)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:11
posted:7/13/2011
language:English
pages:21