VIEWS: 2 PAGES: 38 POSTED ON: 5/5/2013
Generalized Planted (l,d)-Motif Problem with Negative Set Presented by Marcel Schulz Sponsored by The generalized (l,d)-motif problem Journal Club 15.11.2005 Outline The planted (l,d)-motif problem Formulation & limitations The generalized (l,d)-motif problem with negative set Formulation Solving both problems Voting algorithms Experimental results The generalized (l,d)-motif problem Journal Club 15.11.2005 Outline The planted (l,d)-motif problem Formulation & limitations The generalized (l,d)-motif problem with negative set Formulation Solving the problems Voting algorithms Experimental results The generalized (l,d)-motif problem Journal Club 15.11.2005 Motivation Transcription factor binding sites, microRNA target sites Algorithms for the discovery of short motifs in DNA are a prominent issue in Bioinformatics research The generalized (l,d)-motif problem Journal Club 15.11.2005 Motivation Transcription factor binding sites, microRNA target sites Algorithms for the discovery of short motifs in DNA are a prominent issue in Bioinformatics research [1] The generalized (l,d)-motif problem Journal Club 15.11.2005 The planted (l,d)-motif problem introduced in 2000 by Pavel Pevzner and Sing-Hoi Sze[2] Find the motif M of length l ? d=1 l=3 x=mismatch given: 1 . •T sequences of length n . x . •one d-variant of M in every T x sequence The generalized (l,d)-motif problem Journal Club 15.11.2005 The planted (l,d)-motif problem The Neighbourhood of a motif M d i N(M,d) = Σ( )3 l i i=0 Neighbourhood for different values of d and l d \ l 3 5 9 15 0 1 1 1 1 2 37 106 352 991 3 64 376 2620 13276 The generalized (l,d)-motif problem Journal Club 15.11.2005 The planted (l,d)-motif problem The expected number of length-9 strings in T that have at least one d-variant of M, (20 sequences of length 600) Unsolvable region 1 above 1 (9,2) is a challenging problem Problems (9,>=3) are unsolvable l=9 The generalized (l,d)-motif problem Journal Club 15.11.2005 The planted (l,d)-motif problem The expected number of length-9 strings in T that have at least one d-variant of M, (20 sequences of length 600) Unsolvable region 1 above 1 l=9 l=15 Problems (15,>=6) l=30 (30,>=14) are unsolvable The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem introduced in 2005 by Henry C.M. Leung & Francis Y. L. Chin [3] True set T False set F d=1 l=3 x=mismatch 1 1 . . . x x . . . x x x T F The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem introduced in 2005 by Henry C.M. Leung & Francis Y. L. Chin [3] True set T False set T d=1 l=3 x=mismatch 1 1 . . . x x . . . x x x T F no d-variant of this string is the motif The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem Expected number of length-9 strings 1 that don‘t have any d-variant of M in F l=9 in T not in F The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem Expected number of length-l strings that don‘t have a d-variant of M in F 1 l=30 low d more information in T l=9 high d more information in F in T not in F The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem Expected number of length-l strings that have at least on d-variant of M in T but no d-variant of M in F 1 l=30 in T l=9 not in F in T but not in F The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem all generalized problems for l <= 20 are solvable we have new challenging generalized (30,13) and (30,14)- problems The generalized (l,d)-motif problem Journal Club 15.11.2005 The generalized planted (l,d)-motif problem all generalized problems for l <= 20 are solvable we have new challenging generalized (30,13) and (30,14)- problems The generalized (l,d)-motif problem Journal Club 15.11.2005 Solving Both Problems Depending on d we use a different strategy small d large d search in the search in the True set False set filter with the filter with the False set True set The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting Algorithms Idea 1: Motif M is a d-variant of all its` d-variants d=1 l= 3 Motif M = ACG ACT ACG 1-variant 1-variant of M of ACT We know: Motif M gets 1 vote from every sequence ! The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i= 1 j=1 C={} do if V[s] = T then insert s into C s V[s] R[s] Example: AT 0 0 d=0 T1 = A T A C l=3 TA 0 0 T2 = G A T A M = AT AC 0 0 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=1 j=1 C={} do if V[s] = T then insert s into C s V[s] R[s] Example: AT 1 0 d=0 T1 = A T A C l=3 TA 0 0 T2 = G A T A M = AT AC 0 0 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=1 j=1 C={} do if V[s] = T then insert s into C s V[s] R[s] Example: AT 1 1 d=0 T1 = A T A C l=3 TA 0 0 T2 = G A T A M = AT AC 0 0 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=1 j=2 C={} do if V[s] = T then insert s into C s V[s] R[s] Example: AT 1 1 d=0 T1 = A T A C l=3 TA 1 1 T2 = G A T A M = AT AC 0 0 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=1 j=3 C={} do if V[s] = T then insert s into C s V[s] R[s] Example: AT 1 1 d=0 T1 = A T A C l=3 TA 1 1 T2 = G A T A M = AT AC 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=2 j=1 C={} do if V[s] = T then insert s into C s V[s] R[s] AT 1 1 Example: d=0 T1 = A T A C TA 1 1 l=3 T2 = G A T A AC 1 1 M = AT GA 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=2 j=2 C={} do if V[s] = T then insert s into C s V[s] R[s] AT 2 2 Example: d=0 T1 = A T A C TA 1 1 l=3 T2 = G A T A AC 1 1 M = AT GA 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) i=2 j=3 C={} do if V[s] = T then insert s into C s V[s] R[s] AT 2 2 Example: d=0 T1 = A T A C TA 2 2 l=3 T2 = G A T A AC 1 1 M = AT GA 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) T = 2, j = 1, C={} do if V[s] = T then insert s into C s V[s] R[s] AT 2 2 Example: d=0 T1 = A T A C TA 2 2 l=3 T2 = G A T A AC 1 1 M = AT GA 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) T = 2, j = 2, C = { AT} do if V[s] = T then insert s into C s V[s] R[s] AT 2 2 Example: d=0 T1 = A T A C TA 2 2 l=3 T2 = G A T A AC 1 1 M = AT GA 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from T C={} # set with candidate motifs for i = 1 to T do for j = 1 to n – l + 1 do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i then V[s] = V[s] + 1 R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) T = 2, j = 3, C = { AT,TA} do if V[s] = T then insert s into C s V[s] R[s] AT 2 2 Example: d=0 T1 = A T A C TA 2 2 l=3 T2 = G A T A AC 1 1 M = AT GA 1 1 The generalized (l,d)-motif problem Journal Club 15.11.2005 Filter from False set F C = { AT,TA}, C* = { } # set with candidate motifs for a = 1 to |C| true = 1 do for i = 1 to F do for j = 1 to n – l + 1 if Ca is in Neighbourhood of s = Fi [j…j+l-1] true = 0 if true == 1 then insert Ca into C* Example: d=0 F1 = G G G A a = 2, i = 1, j = 3 l=3 F2 = C C C A M = AT The generalized (l,d)-motif problem Journal Club 15.11.2005 Filter from False set F C = { AT,TA}, C* = { } # set with candidate motifs for a = 1 to |C| true = 1 do for i = 1 to F do for j = 1 to n – l + 1 if Ca is in Neighbourhood of s = Fi [j…j+l-1] true = 0 if true == 1 then insert Ca into C* Example: d=0 F1 = G G T A C* = { AT } l=3 F2 = C C T A M = AT The generalized (l,d)-motif problem Journal Club 15.11.2005 Search and Filtering # voting from T C={} for i = 1 to T do for j = 1 to n – l + 1 d i do for each length-l string s in N(s=Ti [j…j+l-1],d) do if R[s] <> i nT Σ ( ) 3 l i i=0 then V[s] = V[s] + 1 Neighbourhood(s,d) R[s] = i for j = 1 to n – l + 1 do for each length-l string s in N(s=Tt [j…j+l-1],d) d i do if V[s] = T then insert s into C nΣ( )3 l i i=0 for a = 1 to |C| do for i = 1 to F do for j = 1 to n – l + 1 if Ca is in Neighbourhood of s = Fi [j…j+l-1] |C| n F l We can solve the (9,<=2),(15,<=5), challenging(30,<=13)-problems The generalized (l,d)-motif problem Journal Club 15.11.2005 Solving Both Problems Depending on d we use a different strategy small d large d vote from the vote from the True set False set filter with the filter with the False set True set The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from F find length-l string that has no d-variant in F C={} for i = 1 to F do for j = 1 to n – l + 1 d i do for each length-l string s in N(s=Fi [j…j+l-1],d) nF Σ ( ) 3 l i do if R[s] <> i i=0 then V[s] = V[s] + 1 R[s] = i not suitable for large d ! reduce d and l to values which have acceptable running time The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from F Example: consider a generalized (4,3)-problem vote from F with a (3,2)-problem recombine candidate motifs and filter with T Motif M = ATCG vote from F with (3,2)-problem find: prefix ATC suffix TCG recombine to Motif M ATCG filter out false candidate motifs with T The generalized (l,d)-motif problem Journal Club 15.11.2005 Search with Voting from F using reduced generalized problems we can solve: l 9 15 30 d >=3 >=6 >=20 d‘ 1 4 6 by first voting from F recombine overlapping candidate motifs filtering with T The generalized (l,d)-motif problem Journal Club 15.11.2005 Experimental results T yeast promoter sequences each containing d-variant of the motif F randomly picked yeast promoter sequences d=1 found the binding sites for all sets within one second The generalized (l,d)-motif problem Journal Club 15.11.2005 [1] medline trend Dan corlan [2] Pavel A. Pevzner, Sing-Hoi Sze, Combinatorial Approaches to Finding Subtle Signals in DNA Sequences, International Conference on Intelligent Systems for Molecular Biology 8 (200) 269-278 [3]Generalized planted (l,d)-motif problem with negative set, Henry C.M. Leung , Francis Y. L. Chin, WABI 2005 The generalized (l,d)-motif problem Journal Club 15.11.2005