Document Sample

CS575dl Smith-Waterman Sequence Alignment in CUDA Doug Hains, Wim Bohm CS, CSU Sequence Alignment • When you Google "ocurrance" Google says: did you mean occurrence? using a measure of similarity of strings based on gaps and mismatches: o-currance occurrence 1 gap, 1 mismatch Bio Informatics • Genome: string of DNA {A,C,G,T} characters • Amino acid: string of proteins (up to 20 characters) • Important question: How "nearly similar" is string X compared to string Y? • 1970-s Needleman and Wunsch (NW) defined "similarity" and a Dynamic Programming solution • 1980-s Smith and Waterman (SW) slightly changed similarity definition and hence algorithm • There is a CUDA implementation CUDASW++ of SW that we improved Formulation • X: x1x2...xm Y:y1y2...yn • A matching is a set of ordered pairs (i,j), i from {1,...,m} and j from {1,...,n}, so that each i and j occur in at most one pair: stop vs tops (2,1) (3,2) (4,3) • An alignment is a matching without crossings: if (i,j) and (i',j') are in the alignment then i<i' implies j<j' j-s i-s Formulation • X: x1x2...xm Y:y1y2...yn • A matching is a set of ordered pairs (i,j), i from {1,...,m} and j from {1,...,n}, so that each i and j occur in at most one pair: stop vs tops (2,1) (3,2) (4,3) • An alignment is a matching without crossings: if (i,j) and (i',j') are in the alignment then i<i' implies j<j' j' < j j-s i-s i < i' Optimal Sequence Alignment Let M be an alignment between X and Y, then 1) There is a parameter g, defining the gap penalty, for each position in X or Y not matched. 2) There is a match/mismatch scoring matrix Spq for lining up p with q. 3) Matches, mismatches and gaps infer an alignment score of M. Match scores are positive, mismatches and gaps are negative. 4) We want to maximize the score. example • Two alignments for aabc and abb aabc -abb g + Saa + Sbb + Scb aabc abb- Saa + Sab + Sbb + g best alignment determined by the values of S and g Structure Either (Xm,Yn), the last chars of X and Y, are in the optimal alignment, or not. If not, Xm and Yn cannot both be in another alignment, because that would cause a crossing. Eg g in X and c in Y cannot both match with t in a a t g a a t c Structure • Therefore, in an optimal alignment M, 1 of the following holds (we don't align two gaps): (m,n) in M aligned, either matched or mismatched m not in M (mth pos of X not matched) n not in M (nth pos of Y not matched) • This gives rise to a recurrence Cij= max( Ci-1,j-1+SXi,Yj, Ci-1,j+g, Ci,j-1+g) Ci0= i* gap penalty C0j= j* gap penalty Dynamic programming solution Cij= max( Ci-1,j-1+ SXi,Yj, Ci-1,j+g, Ci,j-1+g) Ci0= i * g C0j= j * g We build C bottom up, starting from row Ci0 and col C0j and filling in elements in dataflow order, obeying the data dependences: Ci-1,j-1 Ci-1,j Ci,j-1 Ci,j What viable orders would we have? Global Sequence Alignment • The Needleman–Wunsch dynamic programming algorithm performs a global alignment on two sequences, matching the complete sequences from beginning to end • Three steps – Initialization – Scoring • Based on the recurrence – Trace back (Alignment) Scoring: similarity and gap penalty • Using a similarity matrix S and gap penalty g, eg (Wikipedia) g = -5, S = A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8 Initialization Step • Aligning eg ATCG with TCGA; create a score matrix with 5 rows and 5 columns, with the 1st row and column multiples of gap penalty T C G A 0 -5 -10 -15 -20 A -5 T -10 C -15 G -20 Max score • The score of cell C[i, j] is the maximum of: scorediag = C[i-1, j-1] + S[Xi, Yj] scoreup = C[i-1, j] + g scoreleft = C[i, j-1] + g Score matrix • C T C G A 0 -5 -10 -15 -20 A -5 -4 -8 -11 -5 T -10 3 -4 -9 -10 C -15 -2 12 7 2 G -20 -7 7 19 14 • S A G C T g = -5 A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8 Trace back • The trace back step determines the actual alignment(s) – There are likely to be multiple maximal alignments • Trace back starts from the bottom right cell • Gives alignment in reverse order • Considers the three possible steps back and picks one that yields the (best) score in current cell. Trace back from bottom right T C G A 0 -5 -10 -15 -20 A -5 -4 -8 -11 -5 T -10 3 -4 -9 -10 C -15 -2 12 7 2 G -20 -7 7 19 14 Best Alignment: ATCG_ _ T C GA Local Sequence Alignment • The Smith-Waterman dynamic programming algorithm performs a local alignment on two sequences, finding a maximum alignment score for a subsequence of X and a subsequence of Y • It is useful for dissimilar sequences with regions of similarity Needleman-Wunsch vs Smith-Waterman • Same alignment structure, very similar recurrence: Cij= max( 0, Ci-1,j-1+ S[Xi,Yj], Ci-1,j+g, Ci,j-1+g) Ci0= C0j= 0 • Consequence: while filling the score matrix, if a score becomes negative, put a 0 instead • In the traceback, start with the cell that has the highest score and work back until a cell with a score of 0 is reached Example GACG vs AACA A A C A 0 0 0 0 0 G 0 0 0 0 0 A 0 10 10 5 10 C 0 5 7 19 14 G 0 0 4 14 18 • S: A G C T g=-5 A 10 -1 -3 -4 While building the G -1 7 -5 -3 table, keep track of C -3 -5 9 0 max score and its T -4 -3 0 8 position Gap penalty • The standard SW algorithm makes a difference between the penalty of starting a gap (-ρ) and of extending a gap (-σ). • This is expressed using 3 recurrences (implemented using 3 tables): Hij= max(0,Eij,Fij,Hi-1,j-1+S[Xi,Yj]) Eij= max(Ei,j-1-σ,Hi,j-1-ρ) Fij= max(Fi-1,j-σ,Hi-1,j-ρ)

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 11 |

posted: | 7/13/2011 |

language: | English |

pages: | 21 |

OTHER DOCS BY hcj

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.