Document Sample

Database Filtering March 2006 Vineet Bafna Project/Exam deadlines • May 2 – Send email to me with a title of your project • May 9 – Each student/group gives a 10 min. presentation on their proposed project. – Show preliminary computations. What is the test plan? What is the data like, and how much is there. • Last week of classes: – A 20 min. presentation from each group – A written report on the project – A take home exam, due electronically on the date of the final exam March 2006 Vineet Bafna Building better filters • Better filters for ncRNA is an open and relatively unresearched problems. • In contrast, filters for sequence searches have been extensively researched – Some non-intuitive ideas. • We will digress into sequence based filters to see if some of the principles can be exported to other domains. March 2006 Vineet Bafna Large Database Search • Given a query of length m – Identify all sub-sequences in a database that aligns with a high score. – Imagine the database to be a single long string of length n • The straightforward algorithm would employ a scan of the database. How much time would it take? query Sequnce database March 2006 Vineet Bafna D.P. computation i j • The entire computation is one large local alignment. • S[i,j]: score of the best local alignment of prefix 1..i of the database against prefix 1..j of the query. March 2006 Vineet Bafna Large database search Database (n) Database size n=10M, Querysize m=300. Query (m) O(nm) = 3. 109 computations March 2006 Vineet Bafna Filtering • The goal of filtering is to reduce the search space to o(nm) using a fast filter • How can we filter? March 2006 Vineet Bafna Observations • Much of the database is random from the query’s perspective • Consider a random DNA string of length n. – Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 • Assume for the moment that the query is all A’s (length k). • What is the probability that an exact match to the query can be found? March 2006 Vineet Bafna Basic probability • Probability that there is a match starting at a fixed position i = 0.25k • What is the probability that some position i has a match. • Dependencies confound probability estimates. • Related question: What is the expected number of hits? March 2006 Vineet Bafna Basic Probability:Expectation • Q: Toss a coin: each time it comes up heads, you get a dollar – What is the money you expect to get after n tosses? – Let Xi be the amount earned in the i-th toss E(Xi ) 1.p 0.(1 p) p Total money you expect to earn E( X i ) E(X i ) np i i March 2006 Vineet Bafna Expected number of matches i Let Xi=1 if there is a match starting at position i, Xi=0 otherwise Pr(Match at Position ) pi 0.25k i E(X i ) pi 0.25k Expected number of matches = E( X i ) E(X i ) n 1 4 k i i March 2006 Vineet Bafna Expected number of exact Matches is small! • Expected number of matches = n*0.25k – If n=107, k=10, • Then, expected number of matches = 9.537 – If n=107, k=11 • expected number of hits = 2.38 – n=107,k=12, • Expected number of hits = 0.5 < 1 • Bottom Line: An exact match to a substring of the query is unlikely just by chance. March 2006 Vineet Bafna Blast filter • Take all m-k words of length k. • Filter: Consider only those sequences that match at least one of these words. • Expected number of matches in a random database? =(m-k)(n-k) (1/4)k • Efficiency = (1/4)k • A small increase in k decreases efficiency considerably • What can we say about accuracy? March 2006 Vineet Bafna Observation 2: Pigeonhole principle Suppose we are looking for a database string with greater than 90% identity to the query (length 100) Partition the query into size 10 substrings. At least one must match the database string exactly March 2006 Vineet Bafna Why is this important? • Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. • Assume that the mismatches are randomly distributed. • What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? 8 10 90 1 10 0.000036 • Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. – The above equation does not take dependencies into account – Reality is better because the matches are not randomly distributed March 2006 Vineet Bafna Combining the Facts • Consider the set of all substrings of the query string of fixed length W. – Prob. of exact match to a random database string is very low. – Prob. of exact match to a true homolog is very high. – This filter is efficient and accurate. What about speed? – Keyword Search (exact matches) is MUCH faster than sequence alignment March 2006 Vineet Bafna BLAST Database (n) • Consider all (m-W) query words of size W (Default = 11) • Scan the database for exact match to all such words • For all regions that hit, extend using a dynamic programming alignment. • Can be many orders of magnitude faster than SW over the entire string March 2006 Vineet Bafna Why is BLAST fast? • Assume that keyword searching does not consume any time and that alignment computation the expensive step. • Query m=1000, random Db n=107, no TP 50 • SW = O(nm) = 1000*107 = 1010 computations 50 • BLAST, W=11 • E(#11-mer hits)= 1000* (1/4)11 * 107=2384 • Number of computations = 2384*100*50=1.292*107 • Ratio=1010/(1.292*107)=774 • Further speed improvements are possible March 2006 Vineet Bafna Filter Speed: Keyword Matching • How fast can we match keywords? • Hash table/Db index? What is the size of the AATCA 567 hash table, for m=11 • Suffix trees? What is the size of the suffix trees? • Trie based search. We will do this in class. March 2006 Vineet Bafna Dictionary Matching 1:POTATO POTASTPOTATO 2:POTASSIUM 3:TASTE database dictionary • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? March 2006 Vineet Bafna Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? – Trivial algorithm O(nm) time – Pre-processing O(m), Search O(n) time. • Dictionary matching – Trivial algorithm (l1+l2+l3…)n – Using a keyword tree, lpn (lp is the length of the longest pattern) – Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case March 2006 Vineet Bafna Direct Algorithm POPOPOTASTPOTATO POTAT TO POTTA POA POTATO O TO POTATO Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. March 2006 Vineet Bafna The Trie Automaton • Construct an automaton A from the dictionary – A[v,x] describes the transition from node v to a node w upon reading x. – A[u,’T’] = v, and A[u,’S’] = w – Special root node r – Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO u v r P O T A T O 1 2:POTASSIU M T S 3:TASTE S I U M 2 w A S T E 3 March 2006 Vineet Bafna An O(lpn) algorithm for keyword matching • Start with the first position in the db, and the root node. • If successful transition – Increment current pointer – Move to a new node – If terminal node “success” • Else – Retract ‘current’ pointer – Increment ‘start’ pointer – Move to root & repeat March 2006 Vineet Bafna Illustration: l c c POTASTPOTATO v P O T A T O 1 T S S I U M A S T E March 2006 Vineet Bafna Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match – Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) l c POTASTPOTATO POTASSIUM Pattern i TASTE 1:POTATO Pattern j 2:POTASSIUM March 2006 Vineet Bafna 3:TASTE Improving speed of dictionary matching • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 1 P 2 O 3 T 4 A 5 T O T S6 11 7 S I U M A 9 10 March 20068 S T E Vineet Bafna An O(n) alg. For keyword matching • Start with the first position in the db, and the root node. • If successful transition – Increment current pointer – Move to a new node – If terminal node “success” • Else (if at root) – Increment ‘current’ pointer – Mv ‘start’ pointer – Move to root • Else – Move ‘start’ pointer forward – Move to failure node March 2006 Vineet Bafna Illustration POTASTPOTATO l c P O T A T O 1 v T S S I U M A S T E March 2006 Vineet Bafna Time analysis • In each step, either c is incremented, or l is incremented • Neither pointer is ever decremented (lp[v] < c-l). • l and c do not exceed n • Total time <= 2n l c POTASTPOTATO March 2006 Vineet Bafna Blast: Putting it all together • Input: Query of length m, database of size n • Select word-size, scoring matrix, gap penalties, E- value cutoff March 2006 Vineet Bafna Blast Steps 1. Generate an automaton of all query keywords. 2. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. 3. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. 4. For each alignment with score S, compute the bit-score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. 5. Output results. March 2006 Vineet Bafna Can we improve the filter? • For a query word of size M, – Consider a binary string Q of length M with W<=M ones. – Q ‘matches’ a substring as long as the ‘ones’ match 11010011010 M=11 ACCGTCACGT W=6 A ACCATAAACAGAUACTTAATTTGGW = weight of GA spaced seed March 2006 Vineet Bafna Can Spaced seeds help? • The ‘spaced seed’ for BLAST has W consecutive 1s. • Efficiency? – Blast Expected(hits) = n pW – For any (M,W), expected hits =~ npW • Accuracy? March 2006 Vineet Bafna Accuracy • Consider a 64bp sequence that is 70% similar to the query. • Pr(an 11 mer matches) = 0.3 • Pr(A spaced seed 11101001.. Matches) = 0.466 • This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity • Implemented in PATTERNHUNTER March 2006 Vineet Bafna How to compute a spaced seed • No good algorithm is known. • Iterate over all (M choose W) seeds. – Use a computation to decide Pr(match) – Choose the seed that maximizes probability. March 2006 Vineet Bafna Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. • We can assume that there is a probability p of match. • The match mismatch string is a binary string with probability p of 1 1 L 1110111011101111110111100 March 2006 Vineet Bafna Prob. Computation for Spaced Seeds • Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. – Q is a binary string of length M, with W 1s • We try to match the binary ‘match string’ S which is a random binary string with probability p of success. M 1 L 110…0.1…1..0 • PQ = Prob. (Q matches random S at some location) • How can we compute PQ? March 2006 Vineet Bafna Computing F(i,b) • For a specific string b, define • F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in B) 1 i b March 2006 Vineet Bafna Why is it sufficient to compute f(i,b) • PQ = f(L,) b • We have two possibilities: • b B1 : b is consistent with a suffix of Q. • b B0 = B-B1 110001 Q 110001 March 2006 Vineet Bafna Computing f(i,b) • Case b B0 b – f(i,b) = f(i-1,b>>1) Q • Case b B1 and |b| = M – f(i,b) = 1 March 2006 Vineet Bafna Computing f(i,b) • Case b B1 b Q – f(i,b) = pf(i-1,1b) + (1-p) pf(i-1,0b) March 2006 Vineet Bafna

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 0 |

posted: | 5/8/2013 |

language: | English |

pages: | 42 |

OTHER DOCS BY yurtgc548

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.