Embed
Email

Atlanta

Document Sample

Shared by: ajizai
Categories
Tags
Stats
views:
2
posted:
11/25/2011
language:
English
pages:
36
Detecting selection at the sequence level





Alexey Kondrashov, NCBI, NIH

1. Selection prevents (if negative or balancing) or facilitates

(if positive) allele replacements.









2. DNA sequences are chronicles of past allele replacements,

mostly of single-nucleotide substitutions.









3. Ongoing selection can be revealed by data on within-

population polymorphism.

3 most popular methods of detecting selection



1. Comparison of the actual rate of evolution with the rate of

neutral evolution (which is equal to mutation rate).



Reduced (elevated) rate of evolution indicates negative

(positive) selection. The larger is the number of sequences

compared, the better the resolution is, in terms of the minimal

number of sites at which selection can still be detected.

Achievements:

a) a majority (80-95%) on nonsynonymous mutations, and

many intergenic mutations (10% in mammals, and perhaps

50% in Drosophila melanogaster), are deleterious enough

to never get fixed

b) a small number of amino-acid sites in some proteins are

under persistent positive selection

Problems:

a) Conservation tell us nothing about the strength of negative

selection, except that s > 1/Ne.

b) It is hard to detect occasional episodes of positive selection

at sites which are usually under negative selection (clade-

specific constraints can help a bit).

2. Comparison of fractions of some events among replacements

vs. among polymorphisms (MacDonald-Kreitman test)

Rationale: positive selection causes a lot of replacements but, at

each moment, only a small amount of polymorphism

# of studied replacements # of studied polymorphisms

----------------------------------- ~ ------------------------------------

# of neutral replacements # of neutral polymorphisms

positive selection

Advantage: positive selection can be detected even if the overall

replacement rate is below the neutral one.

Achievement: the fraction of amino-acid replacements in Drosophila

simulans, driven by positive selection is 25% ± 20% (Bierne & Eyre-

Walker, Mol. Biol. Evol. 21, 1350, 2004).

Problems: 1) Due to many rare, deleterious polymorphisms this test

is way too conservative. 2) One cannot study individual sites.

3. Analysis of polymorphisms

i) An excess of rare alleles reveals negative selection

ii) A high frequency of a young DNA segment (with few SNPs)

indicates an ongoing sweep, driven by positive selective. Local

cluster of frequent, derived SNPs indicates,a recently completed

sweep (hitch-hiking).

iii) A local cluster of common SNPs may indicate balancing selection.





Achievements:

1) 10-20% of nonsynonymous polymorphisms are substantially

deleterious.

2) Several ongoing selective sweeps have been described.

Problem: only ongoing or rather recent selection can be studied

through SNPs alone.

What can we do, if we align:

1 sequence?

2 sequences?

3 sequences?

4 sequences?

5 sequences?

6 sequences?

Many sequences?

Our ability to detect past (distant or recent) selection

depends critically on the number of orthologous sequences

available for comparison. An extra sequence provides new

possibilities, some of which are underappreciated.

1 sequence



Volatility of a codon is the number of possible nonsynonymous

single-nucleotide substitutions in it.



It has been proposed recently (Plotkin, Dushoff & Fraser,

Nature 428, 942, 2004) that volatility is increased, due to the

Appropriate choice of synonymous codons, at sites which encode

amino acids under strong positive selection.



The subject is rather controversial (e. g., Stoletzki et al., Mol. Biol.

Evol. 22, 2022, 2005).



I doubt that volatility is a very useful indicator of past positive

selection, except, perhaps, case of very strong selection.

2 Sequences



From a pairwise alignment, one can reliably detect selection only in

long enough segments of sequences. Thus, alignments of 2 sequences

are useful for detecting negative selection, but generally not suitable for

detecting positive selection since long segments which are consistently

under positive selection are very rare.



An often-overlooked fact: constant selection can both decrease (of

course!) and increase the rate of evolution (Eyre-Walker, 1992) and the

level of polymorphism (McVean & Charlesworth, 1999). This

counterintuitive phenomenon happens if weak, constant selection

favors more mutable allele(s).

Selection in Favor of Nucleotides G and C Diversifies Evolution

Rates and Levels of Polymorphism at Mammalian Synonymous

Sites



Kondrashov, Ogurtsov & Kondrashov, J. Theor. Biol., in press



Equal rates of evolution and levels of polymorphisms at mammalian

4-fold synonymous sites vs. neutral intron sites is just a coincidence,

hiding a complex pattern, caused by weak (4Nes ~ 1-2) selection

favoring nucleotides G and C at synonymous sites.



Let us subdivide all sites into 4 non-overlapping classes:

1) nonCpG - prone - not preceded by C and not followed by G

2) postC - C_nonG G is mutable at such a site

3) preG - nonC_G C is mutable

4) postCpreG - C_G both C and G is mutable

This way, the class of a site depends only on its context, and not on

its own state.

1. Synonymous sites of all classes are enriched by G and C, realtive

to the corresponding intron classes, which are at mutation-drift

equilibrium.



2. CpG-prone sites are much more common among synonymous

sites than among intron sites, due to amino-acid composition of

proteins and the genetic code table.



3. CpG-prone sites (especially, postCpreG) evolve faster and are

more polymorphic than nonCpG-prone sites, due to their elevated

mutability of CpG contexts.



4. NonCpG-prone synonymous sites evolve slower, and postCpreG

synonymous sites evolve faster, than the corresponding intron

sites (the same is true for their levels of polymorphism). BOTH

effects are due to weak selection in favor of G's and C's at

synonymous sites.

5. Synonymous sites in mammals are not neutral, as their overall rate

of evolution apparently suggests. They are under selection with

s ~0.00001.

Intron sites outside transposable elements



nonCpG postC preG postCpreG

All



Number 52954891 33887011 7978950 8673858 2415072

(64%) (15%) (16%) (5%)

Frequencies 0.2827 0.2606 0.3007 0.3104 0.4331

A 0.3111 0.2883 0.3373 0.3347 0.4588

T 0.2097 0.2330 0.0423 0.3161 0.0538

G 0.1965 0.2180 0.3197 0.0387 0.0543

C

Divergence 0.01064 0.00932 0.01319 0.01178 0.01663

Polymorphism 0.001294 0.001165 0.001573 0.001416 0.001767









4-fold synonymous sites



nonCpG postC preG postCpreG

All



Number 1949372 682032 (35%) 654300 (34%) 293573 319467

(15%) (16%)

Frequencies 0.2165 0.1478 0.2262 0.2039 0.3550

A 0.2320 0.1597 0.2399 0.2169 0.3840

T 0.2425 0.3479 0.0859 0.4751 0.1245

G 0.3090 0.3446 0.4480 0.1042 0.1365

C

Divergence 0.01282 0.00831 0.01351 0.01182 0.02195

Polymorphism 0.001441 0.001051 0.001529 0.001251 0.002267

0.5 15





0.4









Level of polymorphism

10









Rate of evolution

frequency of G









0.3

nonCpG



postCpreG postCpreG

0.2

5

postC



0.1

postC



nonCpG

0 0

-4 -2 0 2 4 6 8 10

S = 4Nes,selection for G and C

3 sequences



1) We can now polarize replacements, i. e. distinguish the

reciprocal replacements A > B and B > A.



2) We can now assign the replacement to a particular branch

of the phylogenetic tree.

A universal trend of amino acid gain and loss in protein

evolution (Jordan et al., Nature 433, 633, 2005).

Distribution of the strength of selection against

amino acid replacements in human proteins



Yampolsky et al., Hum. Mol. Genet. 14, 3191, 2005.





We combined data on function-killing mutations, human

SNPs, and human-chimp substitutions - all polarized.



Together, these data ascertain the distribution of coefficients

of selection against amino-acid replacements as 4-bin

histograms.

1





0.8

Rate of evolution and









4

level of variability









Ne = 5x104 Ne = 10

0.6





0.4





0.2





0

-6 -5 -4 -3 -2

log s

0.3







0.25







0.2







0.15







0.1







0.05







0

-6 -5 -4 -3 -2 -1 0

log s

1



Fraction of forbidden substitutions



0.8









0.6









0.4









0.2









0

-8 -6 -4 -2 0 2 4 6 8

Difference in Polarity

Table 2. M[log s], the mean values of the logarithms of selection coefficients associated

with 150 amino acid replacements.





Destination amino acid

Source

amino

acid A C D E F G H I K L M N P Q R S T V W Y Mean

A x . -2.20 -2.63 . -3.86 . . . . . . -2.90 . . -3.41 -3.86 -3.64 . . -3.21

C . x . . -0.88 -2.37 . . . . . . . . -2.55 -2.74 . . -2.38 -1.90 -2.14

D -3.18 . x -3.35 . -3.06 -3.43 . . . . -3.43 . . . . . -2.17 . -2.47 -3.01

E -3.01 . -3.41 x . -3.11 . . -3.63 . . . . -3.65 . . . -2.42 . . -3.20

F . -2.23 . . x . . -2.87 . -3.23 . . . . . -2.88 . -2.79 . -3.10 -2.85

G -3.34 -3.00 -2.65 -2.91 . x . . . . . . . . -3.22 -3.57 . -2.32 -1.65 . -2.83

H . . -2.27 . . . x . . -3.18 . -3.14 -2.33 -3.24 -3.79 . . . . -3.04 -3.00

I . . . . -2.64 . . x -1.63 -3.17 -3.05 -2.01 . . -1.63 -2.56 -3.26 -5.00 . . -2.77

K . . . -3.20 . . . -3.17 x . -2.93 -3.12 . -3.28 -3.88 . -3.07 . . . -3.23

L . . . . -3.67 . -2.93 -3.46 . x -3.59 . -2.45 -2.58 -2.12 -3.07 . -3.61 -3.03 . -3.05

M . . . . . . . -3.34 -2.97 -3.31 x . . . -2.35 . -3.72 -4.40 . . -3.35

N . . -3.10 . . . -3.04 -2.90 -2.36 . . x . . . -4.35 -3.05 . . -2.66 -3.07

P -3.72 . . . . . -3.10 . . -3.54 . . x -3.21 -3.04 -3.61 -2.79 . . . -3.29

Q . . . -3.24 . . -3.74 . -2.98 -2.94 . . -2.73 x -4.00 . . . . . -3.27

R . -2.77 . . . -3.24 -2.90 -3.00 -4.94 -2.63 -2.53 . -2.11 -2.89 x -3.17 -3.02 . -3.06 . -3.02

S -3.34 -3.31 . . -3.65 -3.93 . -3.15 . -3.03 . -4.61 -3.25 . -2.87 x -3.41 . -2.93 -3.30 -3.40

T -4.38 . . . . . . -3.55 -3.07 . -3.43 -2.67 -2.88 . -3.18 -3.64 x . . . -3.35

V -3.39 . -2.29 -2.60 -2.45 -3.07 . -4.66 . -3.29 -3.71 . . . . . . x . . -3.18

W . -1.69 . . . -2.79 . . . -2.38 . . . . -3.04 -2.36 . . x . -2.45

Y . -2.64 -2.53 . -3.43 . -3.28 . . . . -3.14 . . . -1.62 . . . x -2.77

Mean -3.48 -2.61 -2.64 -2.99 -2.79 -3.18 -3.21 -3.34 -3.08 -3.07 -3.20 -3.16 -2.66 -3.14 -2.97 -3.08 -3.27 -3.29 -2.61 -2.74 -3.05

Bazykin et al. Nature 429, 558, 2004.



Positive selection at sites of multiple amino acid replacements

since rat−mouse divergence



At ~1% of codons, 2 nonsynonymous substitutions occurred since

rat-mouse divergence









Expected: 25% 50% 25%

Observed: 33% 35% 32%

Within

conservative

segments: 40% 22% 38%

Clumps of non-synonymous substitutions cannot

be explained away by episodes of relaxed negative

selection









Red - negative selection on

Green - negative selection off

0.5 0.1

b=1

b=1









occurring in different lineages

2 2









Fraction of 2-substitution codons

0.4





Fraction of substitutions

4 0.01

4 8

16

0.3 8

32

16 0.001

0.2 32



0.0001

0.1





-5

0 10

0.01 0.1 1 10

Waiting time for switching negative selection on









Simulation of switches of negative selection. The probabilities of

a codon accepting two substitutions since the rat-mouse

divergence (broken lines) and of pattern 1 at a 2-substitution

codon (solid lines) are presented as functions of the expected

waiting times for switches of negative selection (b - fraction of

time when negative selection is on.)

4 sequences



1) There are more than one (two) independent paths

on the phylogenetic tree.



2) We can detect heterogeneity of evolution rates at individual

sites (variance of variance), which opens new possibilities.



2: match vs. mismatch

3: 000 and 111, and six 2:1 configurations - still, only 2 levels of

similarity

4: 0000 and 1111; eight 3:1 configurations, AND six 2:2

configurations - two levels of heterogeneity

Microheterogeneity of evolution rates, at the level of

individual sites, may be especially interesting for selectively

neutral sequences, where evolution rates equal mutation

rates.



How heterogeneous are mutation rates among sites? What

fraction of this heterogeneity is not due to immediate

contexts?









Is human-chimp divergence elevated at sites of

dog-cow divergence?

A possibility for a stringent test of classical neutralism.



Kimura's claim: those amino-acid replacements whichoccur in

evolution are mostly neutral.



We do know that the fraction of sites (in multiple alignments)

where Kn >= Ks is rather low. However, this may be consistent

with neutrality of the occurring replacements. Suppose, for

example, that a site Ala and Val confer the same fitness, and

other amino acids are forbidden. Then, Kn at such a site would be

only ~0.15Ks, but the replacements which occur (Ala > Val and

Val > Ala) are still neutral.



However, the availability of two independent paths makes it

possible to create a more stringent test of neutrality of the

occurring replacements.

If the amino-acid replacements are neutral, than

at a site, whre rat and mouse carry different amino acids

(say, Ala and Val), the same difference between dog and human

must occur, with the probability equal to dog-human neutral

divergence of the corresponding nature (transversion) - which is

~0.2.



Preliminary data show that, in fact this probability is several

times lower (~0.04), which argues against neutrality of a majority

of amino-acid replacements.

5 sequences



Parallel evolution can be studied.



How rapid is divergence of independently evoliving

orthologous sequences?

6 sequences



Convergent evolution can be studied.

Many sequences



Properties of individual sites can be estimated.



Thus, possibilities which first appear with 3 and 4 sequences

can be used better.







Bazykin and Kondrashov - "Distribution of nonsynonymous

substitutions in HIV phylogenies" (in progress)

Non-synonymous substitutions are relatively more common

towards the leaves of the tree, indicating that most of them

are under weak negative selection (Golding & Felsenstein,

J. Mol. Evol. 31, 511, 1990).

The role of positive selection, relative to that of random drift,

may be the highest at the most conservative sites - at least,

non-synonymous substitutions at such sites are clumped the

most.

1. Positive selection acting on proteins should often lead to 2

or 3 nonsynonymous replacements in a rapid succession

simply because a new optimal amino acid is often not a

one-substitution neighbor of the old one.



2. Clumping of nonsynonymous replacements is very

prominent at many overally conservative codons (this is not

really evident from ((rat,mouse)human) comparison).



3. The role of positive selection, relative to that of random drift,

may be the highest at the most conservative codons. If such a

codon evolves, this must be for a reason.



4. It is much more interesting to study evolution at sites which

usually do not evolve - changes at such sites are the most

important ones.

CONCLUSIONS:



3 - directionality, clumping



4 - independent paths, variability of variabilities



5 - parallel evolution



6 - convergent evolution



many - reliable characteristics of individual sites



Related docs
Other docs by ajizai
agc
Views: 1  |  Downloads: 0
Bilaga-10-Invitation-press-FKG
Views: 0  |  Downloads: 0
UnderGrd-1
Views: 0  |  Downloads: 0
Interactiv
Views: 0  |  Downloads: 0
business_toc
Views: 0  |  Downloads: 0
Problems - Welcome to web.gccaz.edu
Views: 0  |  Downloads: 0
student-images-upload
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!