Document Sample

Significance in protein analysis Swapan „Shop‟ Mallick Bioinformatics Group Institute of Biotechnology University of Helsinki Overview The need for statistics Example: BLOSUM What do the scores mean? How can you compare two scores? Example: BLAST Problems with BLAST Review of Distributions Distribution of random BLAST results P-values and e-values Statistics of BLAST Summary and Conclusion Exercise The need for statistics • Statistics is very important for bioinformatics. – It is very easy to have a computer analyze the data and give you back a result. – Problem is to decide whether the answer the computer gives you is any good at all. • Questions: – How statistically significant is the answer? – What is the probability that this answer could have been obtained by random? What does this depend on? Basics N n X Population Sample S Basics N Descriptive statistics n X Population Sample Probability Example: BLOSUM The BLOSUM matrix assigns a probability score for each residue pair in an alignment based on: the frequency with which that pairing is known to occur within conserved blocks of related proteins. Simple since size of population = size of sample BLOSUM matrices are constructed from observations which lead to observed probabilities BLOSUM substitution matrices BLOSUM matrices are used in „log-odds‟ form based on actually observed substitutions. This is because: Ease of use: „Scores‟ can be just added (the raw probabilities would have to be multiplied) Ease of interpretation: S=0 : substitution is just as likely to occur as random S<0 : substitution is more likely to occur randomly than observed S>0 : substitution is less likely to occur randomly than observed Substitution matrices Score of amino acid a Pab is the observed frequency that with amino acid b residues a and b are correlated because of homology S (a, b) log 1 pab fa fb Lambda is a scaling factor equal to 0.347, set so that the scores can be fafb is the expected frequency of seeing residues a and b paired rounded off to sensible together, which is just the product of the frequency of residue integers a multiplied by the frequency of residue b Source: Where did the BLOSUM62 alignment score matrix come from? Eddy S., Nat. Biotech. 22 Aug 2004 Substitution matrices Lambda is a scaling Pab is the observed frequency that factor equal to 0.347, residues a and b are correlated because set so that the scores of homology can be rounded off to sensible integers S pab fa fb e fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function 32.1 iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of 5.7 probabilities of observed/random = i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function 32.1 iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of 5.7 probabilities of observed/random = i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function 32.1 iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of 5.7 probabilities of observed/random = S1 S2 ( S1 S2 ) e /e e Example: BLAST Motivations Exact algorithms are exhaustive but computationally expensive. Exact algorithms are impractical for comparing a query sequence to millions of other sequences in a database (database scanning), and so, database scanning requires heuristic alignment algorithm (at the cost of optimality). Interpret BLAST results - Description ID (GI #, refseq #, DB- Gene/sequence Bit score – higher, better. Links specific ID #) Click to access Definition Click to access the the record in GenBank pairwise alignment Expect value – lower, better. It tells the possibility that this is a random hit Problems with BLAST Why do results change? How can you compare results from different BLAST tools which may report different types of values? How are results (eg evalue) affected by query There are _many_ values reported in the output – what do they mean? Example: Importance of Blast statistics But, first a review. Review What is a distribution? A plot showing the frequency of a given variable or observation. Review What is a distribution? A plot showing the frequency of a given variable or observation. Features of a Normal Distribution Symmetric Distribution = mean Has an average or mean value at the centre Has a characteristic width called the standard deviation (S.D. = σ) Most common type of distribution known Standard Deviations (Z-score) ± 1.0 S.D. 0.683 > + 1.0 S.D. 0.158 ± 2.0 S.D. 0.954 > + 2.0 S.D. 0.023 ± 3.0 S.D. 0.9972 > + 3.0 S.D. 0.0014 ± 4.0 S.D. 0.99994 > + 4.0 S.D. 0.00003 ± 5.0 S.D. 0.999998 > + 5.0 S.D. 0.000001 Mean, Median & Mode Mode Median Mean Mean, Median, Mode In a Normal Distribution the mean, mode and median are all equal In skewed distributions they are unequal Mean - average value, affected by extreme values in the distribution Median - the “middlemost” value, usually half way between the mode and the mean Mode - most common value Different Distributions Unimodal Bimodal Other Distributions Binomial Distribution Poisson Distribution Extreme Value Distribution Binomial Distribution 1 1 1 1 2 1 P(x) = (p + q)n 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 Poisson Distribution x e =0.1 P( x ) x! =1 P(x) =2 Proportion of samples =3 = 10 x Review What is a distribution? A plot showing the frequency of a given variable or observation. What is a null hypothesis? A statistician‟s way of characterizing “chance.” Generally, a mathematical model of randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis. Review What is a distribution? A plot showing the frequency of a given variable or observation. What is a null hypothesis? A statistician‟s way of characterizing “chance.” Generally, a mathematical model of randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis. Review Examples of null hypotheses: Sequence comparison using shuffled sequences. A normal distribution of log ratios from a microarray experiment. LOD scores from genetic linkage analysis when the relevant loci are randomly sprinkled throughout the genome. Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology. Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database. Review What is a p-value? Review What is a p-value? The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” Pr(x > S|null) Review What is the name of the distribution created by sequence similarity scores, and what does it look like? Extreme value distribution, or Gumbel distribution. It looks similar to a normal distribution, but it has a larger tail on the right. Review What is the name of the 8000 distribution created by 7000 sequence similarity scores, 6000 5000 and what does it look 4000 like? 3000 2000 Extreme value distribution, 1000 or Gumbel distribution. 0 It looks similar to a normal <20 30 40 50 60 70 80 90 100 110 >120 distribution, but it has a larger tail on the right. Statistics BLAST (and also local i.e. Smith-Waterman and BLAT scores) between random, unrelated sequences follow the Gumbel Extreme Value Distribution (EVD) Pr(s>S) = 1-exp(-Kmn e-S) This is the probability of randomly encountering a score greater than S. S alignment score m,n query sequence lengths, and length of database resp. K, parameters depending on scoring scheme and sequence composition Bit score : S’ = S – log(K) log(2) BLAST output revisited S’ S E K n m From: Expasy BLAST Review EVD for random blast This is the Upper tail behaviour: EXPECT value = Pr( s > S ) ~ Kmn e-S Evalue 8000 7000 6000 5000 4000 3000 2000 1000 0 <20 30 40 50 60 70 80 90 100 110 >120 Summary Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Summary Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = S – log(K) log(2) Score and bit score grow linearly with Summary the length of the alignment Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = S – log(K) log(2) Score and bit score grow linearly with Summary the length of the alignment Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = S – log(K) log(2) E-value of bit score E = mn2-S’ Score and bit score grow linearly with Summary the length of the alignment Want to be able to compare scores in E-Value shrinks sequences of different compositions or really fast as bit different scoring schemes score grows Score: S = sum(match) – sum(gap costs) Bit score S’ = S – log(K) log(2) E-value of bit score E = mn2-S’ Score and bit score grow linearly with Summary the length of the alignment Want to be able to compare scores in E-Value shrinks sequences of different compositions or really fast as bit different scoring schemes score grows Score: S = sum(match) – sum(gap costs) E-Value grows linearly with the Bit score product of target S’ = S – log(K) and query sizes. log(2) E-value of bit score E = mn2-S’ Score and bit score grow linearly with Summary the length of the alignment Want to be able to compare scores in E-Value shrinks sequences of different compositions or really fast as bit different scoring schemes score grows Score: S = sum(match) – sum(gap costs) E-Value grows linearly with the Bit score product of target S’ = S – log(K) and query sizes. log(2) Doubling target set size E-value of bit score and doubling query E = mn2-S’ length have the same effect on e-value Conclusion You should now be able to compare BLAST results from different databases, converting values if they are reported differently (which happens frequently) You should now know why BLAST results might change from one day to the next, even on the same server You should understand also the dependance of query length on E-value. Statistical rankings are reported for (almost) every database search tool. When making comparisons between databases, between sequences it is useful to know how the statistics are derived to know if comparisons are meaningful. THE END Supplemental Section Look through: Patterns in sequences (Searching for information within sequences) - Some common problems and their solutions: http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm What is the structure of my sequence? http://speedy.embl-heidelberg.de/gtsp/flowchart2.html (clickable!)

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 436 |

posted: | 7/28/2011 |

language: | English |

pages: | 51 |

Description:
Raw Score Bit Score E Value Bioinformatics document sample

OTHER DOCS BY upi12758

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.