Detecting selection at the sequence level
Alexey Kondrashov, NCBI, NIH
1. Selection prevents (if negative or balancing) or facilitates
(if positive) allele replacements.
2. DNA sequences are chronicles of past allele replacements,
mostly of single-nucleotide substitutions.
3. Ongoing selection can be revealed by data on within-
population polymorphism.
3 most popular methods of detecting selection
1. Comparison of the actual rate of evolution with the rate of
neutral evolution (which is equal to mutation rate).
Reduced (elevated) rate of evolution indicates negative
(positive) selection. The larger is the number of sequences
compared, the better the resolution is, in terms of the minimal
number of sites at which selection can still be detected.
Achievements:
a) a majority (80-95%) on nonsynonymous mutations, and
many intergenic mutations (10% in mammals, and perhaps
50% in Drosophila melanogaster), are deleterious enough
to never get fixed
b) a small number of amino-acid sites in some proteins are
under persistent positive selection
Problems:
a) Conservation tell us nothing about the strength of negative
selection, except that s > 1/Ne.
b) It is hard to detect occasional episodes of positive selection
at sites which are usually under negative selection (clade-
specific constraints can help a bit).
2. Comparison of fractions of some events among replacements
vs. among polymorphisms (MacDonald-Kreitman test)
Rationale: positive selection causes a lot of replacements but, at
each moment, only a small amount of polymorphism
# of studied replacements # of studied polymorphisms
----------------------------------- ~ ------------------------------------
# of neutral replacements # of neutral polymorphisms
positive selection
Advantage: positive selection can be detected even if the overall
replacement rate is below the neutral one.
Achievement: the fraction of amino-acid replacements in Drosophila
simulans, driven by positive selection is 25% ± 20% (Bierne & Eyre-
Walker, Mol. Biol. Evol. 21, 1350, 2004).
Problems: 1) Due to many rare, deleterious polymorphisms this test
is way too conservative. 2) One cannot study individual sites.
3. Analysis of polymorphisms
i) An excess of rare alleles reveals negative selection
ii) A high frequency of a young DNA segment (with few SNPs)
indicates an ongoing sweep, driven by positive selective. Local
cluster of frequent, derived SNPs indicates,a recently completed
sweep (hitch-hiking).
iii) A local cluster of common SNPs may indicate balancing selection.
Achievements:
1) 10-20% of nonsynonymous polymorphisms are substantially
deleterious.
2) Several ongoing selective sweeps have been described.
Problem: only ongoing or rather recent selection can be studied
through SNPs alone.
What can we do, if we align:
1 sequence?
2 sequences?
3 sequences?
4 sequences?
5 sequences?
6 sequences?
Many sequences?
Our ability to detect past (distant or recent) selection
depends critically on the number of orthologous sequences
available for comparison. An extra sequence provides new
possibilities, some of which are underappreciated.
1 sequence
Volatility of a codon is the number of possible nonsynonymous
single-nucleotide substitutions in it.
It has been proposed recently (Plotkin, Dushoff & Fraser,
Nature 428, 942, 2004) that volatility is increased, due to the
Appropriate choice of synonymous codons, at sites which encode
amino acids under strong positive selection.
The subject is rather controversial (e. g., Stoletzki et al., Mol. Biol.
Evol. 22, 2022, 2005).
I doubt that volatility is a very useful indicator of past positive
selection, except, perhaps, case of very strong selection.
2 Sequences
From a pairwise alignment, one can reliably detect selection only in
long enough segments of sequences. Thus, alignments of 2 sequences
are useful for detecting negative selection, but generally not suitable for
detecting positive selection since long segments which are consistently
under positive selection are very rare.
An often-overlooked fact: constant selection can both decrease (of
course!) and increase the rate of evolution (Eyre-Walker, 1992) and the
level of polymorphism (McVean & Charlesworth, 1999). This
counterintuitive phenomenon happens if weak, constant selection
favors more mutable allele(s).
Selection in Favor of Nucleotides G and C Diversifies Evolution
Rates and Levels of Polymorphism at Mammalian Synonymous
Sites
Kondrashov, Ogurtsov & Kondrashov, J. Theor. Biol., in press
Equal rates of evolution and levels of polymorphisms at mammalian
4-fold synonymous sites vs. neutral intron sites is just a coincidence,
hiding a complex pattern, caused by weak (4Nes ~ 1-2) selection
favoring nucleotides G and C at synonymous sites.
Let us subdivide all sites into 4 non-overlapping classes:
1) nonCpG - prone - not preceded by C and not followed by G
2) postC - C_nonG G is mutable at such a site
3) preG - nonC_G C is mutable
4) postCpreG - C_G both C and G is mutable
This way, the class of a site depends only on its context, and not on
its own state.
1. Synonymous sites of all classes are enriched by G and C, realtive
to the corresponding intron classes, which are at mutation-drift
equilibrium.
2. CpG-prone sites are much more common among synonymous
sites than among intron sites, due to amino-acid composition of
proteins and the genetic code table.
3. CpG-prone sites (especially, postCpreG) evolve faster and are
more polymorphic than nonCpG-prone sites, due to their elevated
mutability of CpG contexts.
4. NonCpG-prone synonymous sites evolve slower, and postCpreG
synonymous sites evolve faster, than the corresponding intron
sites (the same is true for their levels of polymorphism). BOTH
effects are due to weak selection in favor of G's and C's at
synonymous sites.
5. Synonymous sites in mammals are not neutral, as their overall rate
of evolution apparently suggests. They are under selection with
s ~0.00001.
Intron sites outside transposable elements
nonCpG postC preG postCpreG
All
Number 52954891 33887011 7978950 8673858 2415072
(64%) (15%) (16%) (5%)
Frequencies 0.2827 0.2606 0.3007 0.3104 0.4331
A 0.3111 0.2883 0.3373 0.3347 0.4588
T 0.2097 0.2330 0.0423 0.3161 0.0538
G 0.1965 0.2180 0.3197 0.0387 0.0543
C
Divergence 0.01064 0.00932 0.01319 0.01178 0.01663
Polymorphism 0.001294 0.001165 0.001573 0.001416 0.001767
4-fold synonymous sites
nonCpG postC preG postCpreG
All
Number 1949372 682032 (35%) 654300 (34%) 293573 319467
(15%) (16%)
Frequencies 0.2165 0.1478 0.2262 0.2039 0.3550
A 0.2320 0.1597 0.2399 0.2169 0.3840
T 0.2425 0.3479 0.0859 0.4751 0.1245
G 0.3090 0.3446 0.4480 0.1042 0.1365
C
Divergence 0.01282 0.00831 0.01351 0.01182 0.02195
Polymorphism 0.001441 0.001051 0.001529 0.001251 0.002267
0.5 15
0.4
Level of polymorphism
10
Rate of evolution
frequency of G
0.3
nonCpG
postCpreG postCpreG
0.2
5
postC
0.1
postC
nonCpG
0 0
-4 -2 0 2 4 6 8 10
S = 4Nes,selection for G and C
3 sequences
1) We can now polarize replacements, i. e. distinguish the
reciprocal replacements A > B and B > A.
2) We can now assign the replacement to a particular branch
of the phylogenetic tree.
A universal trend of amino acid gain and loss in protein
evolution (Jordan et al., Nature 433, 633, 2005).
Distribution of the strength of selection against
amino acid replacements in human proteins
Yampolsky et al., Hum. Mol. Genet. 14, 3191, 2005.
We combined data on function-killing mutations, human
SNPs, and human-chimp substitutions - all polarized.
Together, these data ascertain the distribution of coefficients
of selection against amino-acid replacements as 4-bin
histograms.
1
0.8
Rate of evolution and
4
level of variability
Ne = 5x104 Ne = 10
0.6
0.4
0.2
0
-6 -5 -4 -3 -2
log s
0.3
0.25
0.2
0.15
0.1
0.05
0
-6 -5 -4 -3 -2 -1 0
log s
1
Fraction of forbidden substitutions
0.8
0.6
0.4
0.2
0
-8 -6 -4 -2 0 2 4 6 8
Difference in Polarity
Table 2. M[log s], the mean values of the logarithms of selection coefficients associated
with 150 amino acid replacements.
Destination amino acid
Source
amino
acid A C D E F G H I K L M N P Q R S T V W Y Mean
A x . -2.20 -2.63 . -3.86 . . . . . . -2.90 . . -3.41 -3.86 -3.64 . . -3.21
C . x . . -0.88 -2.37 . . . . . . . . -2.55 -2.74 . . -2.38 -1.90 -2.14
D -3.18 . x -3.35 . -3.06 -3.43 . . . . -3.43 . . . . . -2.17 . -2.47 -3.01
E -3.01 . -3.41 x . -3.11 . . -3.63 . . . . -3.65 . . . -2.42 . . -3.20
F . -2.23 . . x . . -2.87 . -3.23 . . . . . -2.88 . -2.79 . -3.10 -2.85
G -3.34 -3.00 -2.65 -2.91 . x . . . . . . . . -3.22 -3.57 . -2.32 -1.65 . -2.83
H . . -2.27 . . . x . . -3.18 . -3.14 -2.33 -3.24 -3.79 . . . . -3.04 -3.00
I . . . . -2.64 . . x -1.63 -3.17 -3.05 -2.01 . . -1.63 -2.56 -3.26 -5.00 . . -2.77
K . . . -3.20 . . . -3.17 x . -2.93 -3.12 . -3.28 -3.88 . -3.07 . . . -3.23
L . . . . -3.67 . -2.93 -3.46 . x -3.59 . -2.45 -2.58 -2.12 -3.07 . -3.61 -3.03 . -3.05
M . . . . . . . -3.34 -2.97 -3.31 x . . . -2.35 . -3.72 -4.40 . . -3.35
N . . -3.10 . . . -3.04 -2.90 -2.36 . . x . . . -4.35 -3.05 . . -2.66 -3.07
P -3.72 . . . . . -3.10 . . -3.54 . . x -3.21 -3.04 -3.61 -2.79 . . . -3.29
Q . . . -3.24 . . -3.74 . -2.98 -2.94 . . -2.73 x -4.00 . . . . . -3.27
R . -2.77 . . . -3.24 -2.90 -3.00 -4.94 -2.63 -2.53 . -2.11 -2.89 x -3.17 -3.02 . -3.06 . -3.02
S -3.34 -3.31 . . -3.65 -3.93 . -3.15 . -3.03 . -4.61 -3.25 . -2.87 x -3.41 . -2.93 -3.30 -3.40
T -4.38 . . . . . . -3.55 -3.07 . -3.43 -2.67 -2.88 . -3.18 -3.64 x . . . -3.35
V -3.39 . -2.29 -2.60 -2.45 -3.07 . -4.66 . -3.29 -3.71 . . . . . . x . . -3.18
W . -1.69 . . . -2.79 . . . -2.38 . . . . -3.04 -2.36 . . x . -2.45
Y . -2.64 -2.53 . -3.43 . -3.28 . . . . -3.14 . . . -1.62 . . . x -2.77
Mean -3.48 -2.61 -2.64 -2.99 -2.79 -3.18 -3.21 -3.34 -3.08 -3.07 -3.20 -3.16 -2.66 -3.14 -2.97 -3.08 -3.27 -3.29 -2.61 -2.74 -3.05
Bazykin et al. Nature 429, 558, 2004.
Positive selection at sites of multiple amino acid replacements
since rat−mouse divergence
At ~1% of codons, 2 nonsynonymous substitutions occurred since
rat-mouse divergence
Expected: 25% 50% 25%
Observed: 33% 35% 32%
Within
conservative
segments: 40% 22% 38%
Clumps of non-synonymous substitutions cannot
be explained away by episodes of relaxed negative
selection
Red - negative selection on
Green - negative selection off
0.5 0.1
b=1
b=1
occurring in different lineages
2 2
Fraction of 2-substitution codons
0.4
Fraction of substitutions
4 0.01
4 8
16
0.3 8
32
16 0.001
0.2 32
0.0001
0.1
-5
0 10
0.01 0.1 1 10
Waiting time for switching negative selection on
Simulation of switches of negative selection. The probabilities of
a codon accepting two substitutions since the rat-mouse
divergence (broken lines) and of pattern 1 at a 2-substitution
codon (solid lines) are presented as functions of the expected
waiting times for switches of negative selection (b - fraction of
time when negative selection is on.)
4 sequences
1) There are more than one (two) independent paths
on the phylogenetic tree.
2) We can detect heterogeneity of evolution rates at individual
sites (variance of variance), which opens new possibilities.
2: match vs. mismatch
3: 000 and 111, and six 2:1 configurations - still, only 2 levels of
similarity
4: 0000 and 1111; eight 3:1 configurations, AND six 2:2
configurations - two levels of heterogeneity
Microheterogeneity of evolution rates, at the level of
individual sites, may be especially interesting for selectively
neutral sequences, where evolution rates equal mutation
rates.
How heterogeneous are mutation rates among sites? What
fraction of this heterogeneity is not due to immediate
contexts?
Is human-chimp divergence elevated at sites of
dog-cow divergence?
A possibility for a stringent test of classical neutralism.
Kimura's claim: those amino-acid replacements whichoccur in
evolution are mostly neutral.
We do know that the fraction of sites (in multiple alignments)
where Kn >= Ks is rather low. However, this may be consistent
with neutrality of the occurring replacements. Suppose, for
example, that a site Ala and Val confer the same fitness, and
other amino acids are forbidden. Then, Kn at such a site would be
only ~0.15Ks, but the replacements which occur (Ala > Val and
Val > Ala) are still neutral.
However, the availability of two independent paths makes it
possible to create a more stringent test of neutrality of the
occurring replacements.
If the amino-acid replacements are neutral, than
at a site, whre rat and mouse carry different amino acids
(say, Ala and Val), the same difference between dog and human
must occur, with the probability equal to dog-human neutral
divergence of the corresponding nature (transversion) - which is
~0.2.
Preliminary data show that, in fact this probability is several
times lower (~0.04), which argues against neutrality of a majority
of amino-acid replacements.
5 sequences
Parallel evolution can be studied.
How rapid is divergence of independently evoliving
orthologous sequences?
6 sequences
Convergent evolution can be studied.
Many sequences
Properties of individual sites can be estimated.
Thus, possibilities which first appear with 3 and 4 sequences
can be used better.
Bazykin and Kondrashov - "Distribution of nonsynonymous
substitutions in HIV phylogenies" (in progress)
Non-synonymous substitutions are relatively more common
towards the leaves of the tree, indicating that most of them
are under weak negative selection (Golding & Felsenstein,
J. Mol. Evol. 31, 511, 1990).
The role of positive selection, relative to that of random drift,
may be the highest at the most conservative sites - at least,
non-synonymous substitutions at such sites are clumped the
most.
1. Positive selection acting on proteins should often lead to 2
or 3 nonsynonymous replacements in a rapid succession
simply because a new optimal amino acid is often not a
one-substitution neighbor of the old one.
2. Clumping of nonsynonymous replacements is very
prominent at many overally conservative codons (this is not
really evident from ((rat,mouse)human) comparison).
3. The role of positive selection, relative to that of random drift,
may be the highest at the most conservative codons. If such a
codon evolves, this must be for a reason.
4. It is much more interesting to study evolution at sites which
usually do not evolve - changes at such sites are the most
important ones.
CONCLUSIONS:
3 - directionality, clumping
4 - independent paths, variability of variabilities
5 - parallel evolution
6 - convergent evolution
many - reliable characteristics of individual sites