# Raw Score Bit Score E Value Bioinformatics

Document Sample

```					Significance in protein
analysis

Swapan „Shop‟ Mallick

Bioinformatics Group
Institute of Biotechnology
University of Helsinki
Overview
The need for statistics
Example: BLOSUM
What do the scores mean?
How can you compare two scores?
Example: BLAST
Problems with BLAST
Review of Distributions
Distribution of random BLAST results
P-values and e-values
Statistics of BLAST
Summary and Conclusion
Exercise
The need for statistics

• Statistics is very important for bioinformatics.
– It is very easy to have a computer analyze the data
and give you back a result.
– Problem is to decide whether the answer the computer
gives you is any good at all.
• Questions:
– How statistically significant is the answer?
– What is the probability that this answer could have
been obtained by random? What does this depend on?
Basics

N     
n X

Population                Sample

S
Basics

N                         Descriptive statistics

n X

Population                        Sample
Probability
Example: BLOSUM

The BLOSUM matrix assigns a probability
score for each residue pair in an alignment
based on:
the frequency with which that pairing is known to
occur within conserved blocks of related proteins.
Simple since size of population = size of sample
BLOSUM matrices are constructed from
probabilities
BLOSUM substitution matrices

BLOSUM matrices are used in
„log-odds‟ form based on
actually observed substitutions.

This is because:
Ease of use: „Scores‟ can be just
have to be multiplied)
Ease of interpretation:
S=0 : substitution is just as likely
to occur as random
S<0 : substitution is more likely
to occur randomly than observed
S>0 : substitution is less likely
to occur randomly than
observed
Substitution matrices

Score of amino acid a                     Pab is the observed frequency that
with amino acid b                         residues a and b are correlated because
of homology

S (a, b)   log           1                 pab
fa fb
Lambda is a scaling
factor equal to 0.347, set
so that the scores can be       fafb is the expected frequency of seeing residues a and b paired
rounded off to sensible         together, which is just the product of the frequency of residue
integers                        a multiplied by the frequency of residue b

Source: Where did the BLOSUM62 alignment score matrix come from?
Eddy S., Nat. Biotech. 22 Aug 2004
Substitution matrices
Lambda is a scaling
Pab is the observed frequency that                          factor equal to 0.347,
residues a and b are correlated because                     set so that the scores
of homology                                                 can be rounded off to
sensible integers
S
pab
fa fb        e

fafb is the expected frequency of seeing
residues a and b paired together, which is just
the product of the frequency of residue a
multiplied by the frequency of residue b
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
probabilities of
observed/random =
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
32.1   iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
5.7
probabilities of
observed/random =
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
32.1   iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
5.7
probabilities of
observed/random =
i) S=0 : O/E ratio=1
ii) Compare S=5 and
S=10. Ratio is based
on exponential
function
32.1        iii) S=-10: O/E ratio
= 0.031 ≈ 1/32.
iv) Ratio of scores
S1, S2 in terms of
5.7
probabilities of
observed/random =
S1        S2         ( S1  S2 )
e         /e         e
Example: BLAST
Motivations
Exact algorithms are exhaustive but
computationally expensive.
Exact algorithms are impractical for comparing
a query sequence to millions of other sequences
in a database (database scanning),
and so, database scanning requires heuristic
alignment algorithm (at the cost of optimality).
Interpret BLAST results - Description

ID (GI #, refseq #, DB-        Gene/sequence Bit score – higher, better.         Links
specific ID #) Click to access Definition    Click to access the
the record in GenBank                        pairwise alignment

Expect value – lower, better. It tells the
possibility that this is a random hit
Problems with BLAST

Why do results change?
How can you compare results from different
BLAST tools which may report different types of
values?
How are results (eg evalue) affected by query
There are _many_ values reported in the output –
what do they mean?
Example: Importance of Blast statistics

But, first a review.
Review

What is a distribution?
A plot showing the frequency of a given variable or
observation.
Review

What is a distribution?
A plot showing the frequency of a given variable or
observation.
Features of a Normal Distribution
Symmetric Distribution
 = mean
Has an average or mean
value at the centre
Has a characteristic width
called the standard deviation
(S.D. = σ)
Most common type of
distribution known
Standard Deviations (Z-score)

 ± 1.0 S.D. 0.683      >  + 1.0 S.D.   0.158

 ± 2.0 S.D. 0.954      >  + 2.0 S.D.   0.023

 ± 3.0 S.D. 0.9972     >  + 3.0 S.D.   0.0014

 ± 4.0 S.D. 0.99994    >  + 4.0 S.D.   0.00003

 ± 5.0 S.D. 0.999998   >  + 5.0 S.D.   0.000001
Mean, Median & Mode
Mode
Median
Mean
Mean, Median, Mode

In a Normal Distribution the mean, mode and
median are all equal
In skewed distributions they are unequal
Mean - average value, affected by extreme values
in the distribution
Median - the “middlemost” value, usually half
way between the mode and the mean
Mode - most common value
Different Distributions

Unimodal      Bimodal
Other Distributions
Binomial Distribution

Poisson Distribution

Extreme Value Distribution
Binomial Distribution
1

1 1

1 2 1
P(x) = (p +   q)n
1 3 3 1

1 4 6 4 1

1 5 10 10 5 1
Poisson Distribution

x e  
 =0.1
P( x )       x!
 =1

P(x)                =2
Proportion of samples

 =3
 = 10

x
Review

What is a distribution?
A plot showing the frequency of a given variable or
observation.
What is a null hypothesis?
A statistician‟s way of characterizing “chance.”
Generally, a mathematical model of randomness with respect to
a particular set of observations.
The purpose of most statistical tests is to determine whether the
observed data can be explained by the null hypothesis.
Review

What is a distribution?
A plot showing the frequency of a given variable or
observation.
What is a null hypothesis?
A statistician‟s way of characterizing “chance.”
Generally, a mathematical model of randomness with respect to
a particular set of observations.
The purpose of most statistical tests is to determine whether the
observed data can be explained by the null hypothesis.
Review

Examples of null hypotheses:
Sequence comparison using shuffled sequences.
A normal distribution of log ratios from a microarray
experiment.
LOD scores from genetic linkage analysis when the
relevant loci are randomly sprinkled throughout the
genome.
Empirical score distribution

The picture shows a
distribution of scores
from a real database
search using BLAST.
This distribution
contains scores from
non-homologous and
homologous pairs.
High scores from homology.
Empirical null score distribution

This distribution is
similar to the previous
one, but generated
using a randomized
sequence database.
Review

What is a p-value?
Review

What is a p-value?
The probability of observing an effect as strong or
stronger than you observed, given the null hypothesis.
I.e., “How likely is this effect to occur by chance?”
Pr(x > S|null)
Review

What is the name of the
distribution created by
sequence similarity scores,
and what does it look
like?
Extreme value distribution, or
Gumbel distribution.
It looks similar to a normal
distribution, but it has a
larger tail on the right.
Review

What is the name of the          8000
distribution created by          7000

sequence similarity scores,      6000
5000
and what does it look            4000
like?                            3000
2000
Extreme value distribution,
1000
or Gumbel distribution.
0
It looks similar to a normal          <20   30   40   50   60   70   80   90   100   110   >120

distribution, but it has a
larger tail on the right.
Statistics
BLAST (and also local i.e. Smith-Waterman and BLAT scores)
between random, unrelated sequences follow the Gumbel Extreme
Value Distribution (EVD)
Pr(s>S) = 1-exp(-Kmn e-S)
This is the probability of randomly encountering a score greater than S.
S alignment score
m,n query sequence lengths, and length of database resp.
K,  parameters depending on scoring scheme and sequence composition
Bit score : S’ =    S – log(K)
log(2)
BLAST output revisited

S’ S     E

     K                n
m
From: Expasy BLAST
Review

EVD for random blast
This is the
Upper tail behaviour:                                EXPECT value =
Pr( s > S ) ~ Kmn e-S                             Evalue

8000
7000
6000
5000
4000
3000
2000
1000
0
<20   30   40   50   60   70   80   90   100   110   >120
Summary

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Summary

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
Score and bit score
grow linearly with
Summary         the length of the
alignment

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
Score and bit score
grow linearly with
Summary       the length of the
alignment

Want to be able to compare scores in
sequences of different compositions or
different scoring schemes
Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
E-value of bit score
E = mn2-S’
Score and bit score
grow linearly with
Summary         the length of the
alignment

Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
different scoring schemes              score grows

Score: S = sum(match) – sum(gap costs)
Bit score
S’ = S – log(K)
log(2)
E-value of bit score
E = mn2-S’
Score and bit score
grow linearly with
Summary         the length of the
alignment

Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
different scoring schemes              score grows

Score: S = sum(match) – sum(gap costs) E-Value grows
linearly with the
Bit score
product of target
S’ = S – log(K)                         and query sizes.
log(2)
E-value of bit score
E = mn2-S’
Score and bit score
grow linearly with
Summary         the length of the
alignment

Want to be able to compare scores in E-Value shrinks
sequences of different compositions or really fast as bit
different scoring schemes              score grows

Score: S = sum(match) – sum(gap costs) E-Value grows
linearly with the
Bit score
product of target
S’ = S – log(K)                         and query sizes.
log(2)
Doubling target set size
E-value of bit score                and doubling query
E = mn2-S’                        length have the same
effect on e-value
Conclusion
You should now be able to compare BLAST results from different
databases, converting values if they are reported differently (which
happens frequently)
You should now know why BLAST results might change from one day to
the next, even on the same server
You should understand also the dependance of query length on E-value.
Statistical rankings are reported for (almost) every database search tool.
When making comparisons between databases, between sequences it is
useful to know how the statistics are derived to know if comparisons are
meaningful.
THE END
Supplemental
Section
Look through: Patterns in sequences (Searching
for information within sequences) - Some
common problems and their solutions:
http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm
What is the structure of my sequence?
http://speedy.embl-heidelberg.de/gtsp/flowchart2.html
(clickable!)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 436 posted: 7/28/2011 language: English pages: 51
Description: Raw Score Bit Score E Value Bioinformatics document sample