Docstoc

Raw Score Bit Score E Value Bioinformatics

Document Sample
Raw Score Bit Score E Value Bioinformatics Powered By Docstoc
					Significance in protein
       analysis



        Swapan „Shop‟ Mallick




    Bioinformatics Group
    Institute of Biotechnology
    University of Helsinki
                          Overview
The need for statistics
Example: BLOSUM
   What do the scores mean?
   How can you compare two scores?
Example: BLAST
   Problems with BLAST
   Review of Distributions
   Distribution of random BLAST results
   P-values and e-values
   Statistics of BLAST
Summary and Conclusion
Exercise
           The need for statistics

• Statistics is very important for bioinformatics.
  – It is very easy to have a computer analyze the data
    and give you back a result.
  – Problem is to decide whether the answer the computer
    gives you is any good at all.
• Questions:
  – How statistically significant is the answer?
  – What is the probability that this answer could have
    been obtained by random? What does this depend on?
             Basics

N     
                           n X


Population                Sample

                      S
             Basics

N                         Descriptive statistics


                                    n X


Population                        Sample
             Probability
         Example: BLOSUM

The BLOSUM matrix assigns a probability
score for each residue pair in an alignment
based on:
  the frequency with which that pairing is known to
  occur within conserved blocks of related proteins.
  Simple since size of population = size of sample
BLOSUM matrices are constructed from
observations which lead to observed
probabilities
   BLOSUM substitution matrices

BLOSUM matrices are used in
„log-odds‟ form based on
actually observed substitutions.

This is because:
    Ease of use: „Scores‟ can be just
   added (the raw probabilities would
   have to be multiplied)
    Ease of interpretation:
       S=0 : substitution is just as likely
      to occur as random
       S<0 : substitution is more likely
      to occur randomly than observed
       S>0 : substitution is less likely
      to occur randomly than
      observed
                             Substitution matrices

      Score of amino acid a                     Pab is the observed frequency that
      with amino acid b                         residues a and b are correlated because
                                                of homology



                      S (a, b)   log           1                 pab
                                                                  fa fb
Lambda is a scaling
factor equal to 0.347, set
so that the scores can be       fafb is the expected frequency of seeing residues a and b paired
rounded off to sensible         together, which is just the product of the frequency of residue
integers                        a multiplied by the frequency of residue b


 Source: Where did the BLOSUM62 alignment score matrix come from?
 Eddy S., Nat. Biotech. 22 Aug 2004
                         Substitution matrices
                                                            Lambda is a scaling
Pab is the observed frequency that                          factor equal to 0.347,
residues a and b are correlated because                     set so that the scores
of homology                                                 can be rounded off to
                                                            sensible integers
                                                       S
                                      pab
                                     fa fb        e

fafb is the expected frequency of seeing
residues a and b paired together, which is just
the product of the frequency of residue a
multiplied by the frequency of residue b
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
probabilities of
observed/random =
       i) S=0 : O/E ratio=1
       ii) Compare S=5 and
       S=10. Ratio is based
       on exponential
       function
32.1   iii) S=-10: O/E ratio
       = 0.031 ≈ 1/32.
       iv) Ratio of scores
       S1, S2 in terms of
5.7
       probabilities of
       observed/random =
       i) S=0 : O/E ratio=1
       ii) Compare S=5 and
       S=10. Ratio is based
       on exponential
       function
32.1   iii) S=-10: O/E ratio
       = 0.031 ≈ 1/32.
       iv) Ratio of scores
       S1, S2 in terms of
5.7
       probabilities of
       observed/random =
            i) S=0 : O/E ratio=1
            ii) Compare S=5 and
            S=10. Ratio is based
            on exponential
            function
32.1        iii) S=-10: O/E ratio
            = 0.031 ≈ 1/32.
            iv) Ratio of scores
            S1, S2 in terms of
5.7
            probabilities of
            observed/random =
           S1        S2         ( S1  S2 )
       e         /e         e
         Example: BLAST
Motivations
 Exact algorithms are exhaustive but
 computationally expensive.
 Exact algorithms are impractical for comparing
 a query sequence to millions of other sequences
 in a database (database scanning),
 and so, database scanning requires heuristic
 alignment algorithm (at the cost of optimality).
   Interpret BLAST results - Description




ID (GI #, refseq #, DB-        Gene/sequence Bit score – higher, better.         Links
specific ID #) Click to access Definition    Click to access the
the record in GenBank                        pairwise alignment

                                          Expect value – lower, better. It tells the
                                          possibility that this is a random hit
        Problems with BLAST

Why do results change?
How can you compare results from different
BLAST tools which may report different types of
values?
How are results (eg evalue) affected by query
There are _many_ values reported in the output –
what do they mean?
Example: Importance of Blast statistics




But, first a review.
                    Review

What is a distribution?
  A plot showing the frequency of a given variable or
  observation.
                    Review

What is a distribution?
  A plot showing the frequency of a given variable or
  observation.
 Features of a Normal Distribution
Symmetric Distribution
                                 = mean
Has an average or mean
value at the centre
Has a characteristic width
called the standard deviation
(S.D. = σ)
Most common type of
distribution known
    Standard Deviations (Z-score)

 ± 1.0 S.D. 0.683      >  + 1.0 S.D.   0.158

 ± 2.0 S.D. 0.954      >  + 2.0 S.D.   0.023

 ± 3.0 S.D. 0.9972     >  + 3.0 S.D.   0.0014

 ± 4.0 S.D. 0.99994    >  + 4.0 S.D.   0.00003

 ± 5.0 S.D. 0.999998   >  + 5.0 S.D.   0.000001
Mean, Median & Mode
      Mode
       Median
        Mean
         Mean, Median, Mode

In a Normal Distribution the mean, mode and
median are all equal
In skewed distributions they are unequal
Mean - average value, affected by extreme values
in the distribution
Median - the “middlemost” value, usually half
way between the mode and the mean
Mode - most common value
 Different Distributions

Unimodal      Bimodal
       Other Distributions
Binomial Distribution

Poisson Distribution

Extreme Value Distribution
       Binomial Distribution
                          1

                         1 1

                        1 2 1
P(x) = (p +   q)n
                       1 3 3 1

                      1 4 6 4 1

                    1 5 10 10 5 1
                               Poisson Distribution

                                                                             x e  
                                =0.1
                                                                  P( x )       x!
                                         =1


                        P(x)                =2
Proportion of samples




                                                   =3
                                                          = 10




                                                           x
                        Review

What is a distribution?
  A plot showing the frequency of a given variable or
  observation.
What is a null hypothesis?
  A statistician‟s way of characterizing “chance.”
  Generally, a mathematical model of randomness with respect to
  a particular set of observations.
  The purpose of most statistical tests is to determine whether the
  observed data can be explained by the null hypothesis.
                        Review

What is a distribution?
  A plot showing the frequency of a given variable or
  observation.
What is a null hypothesis?
  A statistician‟s way of characterizing “chance.”
  Generally, a mathematical model of randomness with respect to
  a particular set of observations.
  The purpose of most statistical tests is to determine whether the
  observed data can be explained by the null hypothesis.
                    Review

Examples of null hypotheses:
  Sequence comparison using shuffled sequences.
  A normal distribution of log ratios from a microarray
  experiment.
  LOD scores from genetic linkage analysis when the
  relevant loci are randomly sprinkled throughout the
  genome.
     Empirical score distribution

The picture shows a
distribution of scores
from a real database
search using BLAST.
This distribution
contains scores from
non-homologous and
homologous pairs.
                         High scores from homology.
  Empirical null score distribution

This distribution is
similar to the previous
one, but generated
using a randomized
sequence database.
                     Review

What is a p-value?
                     Review

What is a p-value?
  The probability of observing an effect as strong or
  stronger than you observed, given the null hypothesis.
  I.e., “How likely is this effect to occur by chance?”
  Pr(x > S|null)
                        Review

What is the name of the
distribution created by
sequence similarity scores,
and what does it look
like?
Extreme value distribution, or
   Gumbel distribution.
It looks similar to a normal
   distribution, but it has a
   larger tail on the right.
                        Review

What is the name of the          8000
distribution created by          7000

sequence similarity scores,      6000
                                 5000
and what does it look            4000
like?                            3000
                                 2000
  Extreme value distribution,
                                 1000
  or Gumbel distribution.
                                   0
  It looks similar to a normal          <20   30   40   50   60   70   80   90   100   110   >120

  distribution, but it has a
  larger tail on the right.
                            Statistics
BLAST (and also local i.e. Smith-Waterman and BLAT scores)
between random, unrelated sequences follow the Gumbel Extreme
Value Distribution (EVD)
Pr(s>S) = 1-exp(-Kmn e-S)
   This is the probability of randomly encountering a score greater than S.
   S alignment score
   m,n query sequence lengths, and length of database resp.
   K,  parameters depending on scoring scheme and sequence composition
Bit score : S’ =    S – log(K)
                        log(2)
                BLAST output revisited



            S’ S     E




       K                n
                         m
From: Expasy BLAST
                               Review

EVD for random blast
                                                     This is the
Upper tail behaviour:                                EXPECT value =
  Pr( s > S ) ~ Kmn e-S                             Evalue

   8000
   7000
   6000
   5000
   4000
   3000
   2000
   1000
     0
          <20   30   40   50   60   70   80   90   100   110   >120
                    Summary

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
                     Summary

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
  S’ = S – log(K)
          log(2)
                                     Score and bit score
                                     grow linearly with
                     Summary         the length of the
                                     alignment

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
  S’ = S – log(K)
          log(2)
                                     Score and bit score
                                     grow linearly with
                       Summary       the length of the
                                     alignment

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
  S’ = S – log(K)
          log(2)
E-value of bit score
  E = mn2-S’
                                       Score and bit score
                                       grow linearly with
                       Summary         the length of the
                                       alignment

Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
different scoring schemes              score grows

Score: S = sum(match) – sum(gap costs)
Bit score
  S’ = S – log(K)
          log(2)
E-value of bit score
  E = mn2-S’
                                       Score and bit score
                                       grow linearly with
                       Summary         the length of the
                                       alignment

Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
different scoring schemes              score grows

Score: S = sum(match) – sum(gap costs) E-Value grows
                                           linearly with the
Bit score
                                           product of target
  S’ = S – log(K)                         and query sizes.
          log(2)
E-value of bit score
  E = mn2-S’
                                       Score and bit score
                                       grow linearly with
                       Summary         the length of the
                                       alignment

Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
different scoring schemes              score grows

Score: S = sum(match) – sum(gap costs) E-Value grows
                                           linearly with the
Bit score
                                           product of target
  S’ = S – log(K)                         and query sizes.
          log(2)
                                    Doubling target set size
E-value of bit score                and doubling query
  E = mn2-S’                        length have the same
                                    effect on e-value
                           Conclusion
You should now be able to compare BLAST results from different
databases, converting values if they are reported differently (which
happens frequently)
You should now know why BLAST results might change from one day to
the next, even on the same server
You should understand also the dependance of query length on E-value.
Statistical rankings are reported for (almost) every database search tool.
When making comparisons between databases, between sequences it is
useful to know how the statistics are derived to know if comparisons are
meaningful.
THE END
Supplemental
Section
Look through: Patterns in sequences (Searching
for information within sequences) - Some
common problems and their solutions:
  http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm
What is the structure of my sequence?
  http://speedy.embl-heidelberg.de/gtsp/flowchart2.html
  (clickable!)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:436
posted:7/28/2011
language:English
pages:51
Description: Raw Score Bit Score E Value Bioinformatics document sample