Population Genetics and the Coalescence

Document Sample
Population Genetics and the Coalescence Powered By Docstoc
					Population Genetics and the Coalescence
Claus Vogl June 16, 2009

Institut f¨r Tierzucht und Genetik, Veterin¨rmedizinische Universit¨t Wien, Veterin¨rplatz 1, u a a a A-1210 Vienna, Austria E-mail: claus.vogl@vu-wien.ac.at Preface: This manuscript was written for an “Universit¨tskurs Biometrie” in the spring a semester of 2009. Please indicate mistakes to me (e-mail with subject containing: “mistake”! Thanks!!!)




Population genetics deals with allelic variation in natural or artificial populations. Among population genetic forces, new variation is generated by mutations; drift (stochastic loss of alleles) generally reduces varation; selection may increase or decrease variation. Fisher (1930, and earlier) and explicitly Wright (1931) developed independently a simple population genetic model with discrete (non-overlapping) generations. Moran (1958) developed a continuous time model with overlapping generations. For the Moran model, exact results are sometimes available that hold approximately (actually very well for only moderately large population sizes) also for the Wright-Fisher model. Both models go forward in time, i.e., from one generation to the next in the case of the Wright-Fisher model, and from one birth/death event in the Moran model to the next. In this course, we will mainly use the Moran model.


The Moran model

The Moran model is supposed to model continuous time. But in its usual form, it actually moves from one step to the next. (Below, we will suggest a truly continuous model that differs also in the mutation aspect from the Moran model.) With the Moran model, in the haploid pure-drift case, at any timestep, i.e.,, from t to t + 1, a population of N haploid individuals is assumed. An individual is picked randomly to reproduce. Suppose the ith individual reproduces. Then another individual j is chosen to die and is replaced by the ith individual. (We will call this a birth/death event.) The process repeats indefinitely. The lifespan of an individal is geometrically distributed with a mean of N . It may thus be useful to rescale time in units of N , i.e., to set τ = t/N , because this is the average lifetime of an individual. This process of drift, will lead to a stochastic loss of variation. Since no new mutations replenish the variation, a monomorphic population will eventually result. With a biallelic model, a transition from one step to the next involves just three possibilities, either the number of the focal allele x increases by one, it remains the same, or it decreases by one: State x+1 x x−1 Probability with Drift only 1/N 2 x(N − x) 1 − 2/N 2 x(N − x) 1/N 2 x(N − x)

Denote the current state with p0 = x/N and consider the heterozygosity 2pt (1 − pt ), i.e.,, the probability of obtaining two different alleles in a sample of two at generation t. The expectation of p1 is E(p1 ) = p0 . The expectation of p2 is E(p2 ) = p2 + 2/N 2 p0 (1 − p0 ) (calculated directly 0 1 1 from Table 2). Hence the expected heterozygosity is: E(2p1 (1 − p1)) = 2(E(p1 ) − E(p2 )) 1 = 2(p0 − p2 − 2/N 2 po (1 − p0 ) 0 = 2p0 (1 − p0 )(1 − 2/N ) . For t steps, this equation needs to be iterated, such that we obtain: E(2pt (1 − pt )) = 2p0 (1 − p0 )(1 − 2/N 2 )t ≈ 2p0 (1 − p0 )e−t/N = 2p0 (1 − p0 )e 2



(2) .

−τ /N

Thus heterozygosity decreases at the rate of 1/N 2 per step or 1/N per “generation”.


The Moran Model with Mutation

With the Moran model, mutation is assumed to occur at every birth/death event, i.e.,, the replacing gamete is assumed to mutate at a rate µ. Many different mutation models have been used; below we will use the K site model. Each site is assumed to toggle between two states (0 and 1). A little R-program for the K-site model: # # # # # # coal_1.R Moran model simulation K-site model Ne haploid individuals

Ne=100 #number of individuals K=10 #number of sites inds=matrix(floor(runif(Ne*K,min=0,max=2)),nrow=Ne,ncol=K) maxTime=Ne*Ne*10 #maximum time mu=0.001 #mutation rate first=ceiling(runif(maxTime,min=0,max=Ne)) second=ceiling(runif(maxTime,min=0,max=Ne)) for(t in 1:maxTime){ #birth/death step inds[second[t],]=inds[first[t],] #mutate mutvec= runif(K,min=0,max=1)<mu for(k in 1:K){ if(mutvec[k]==TRUE){ if(inds[second[t],k]==0){ inds[second[t],k]=1 } else{ inds[second[t],k]=0 } } } }


Truly Continuous Time (TCT) Model

Instead of tying the mutation to birth/death events (as in the Moran model), we can independently choose an individual to mutate at a rate µ. It then makes sense to assume an exponentially distributed lifespan of an individual, such that the model becomes truly continuous.


Remark: We note that the exponential distribution has the following property: the waiting time to the first event of K independent exponential distributions with rates β1 , . . . , βK is again distributed exponentially with rate β = βk . The particular event can then be chosen with probability βk /β. Hence, sampling from independent exponential distributions and then choosing the fastest is equivalent to first sample from a single exponential and then choosing among the events proportianal to their rates. The total rate of events, either birth/death or mutation is then N (1 + µ), because each individual may either die or mutate with rate 1 and µ respectively. Then we pick one individual randomly and let it mutate with probability µ/(1 + µ), and die otherwise. As with the Moran model, we can thus proceed from step to step and add the time only subsequently, assuming exponentially distributed times. Hence, for time independence of model parameters (N and µ), the model again proceeds from step to step. 2.2.1 Equilibrium Distribution

If we set K = 1, i.e.,, model just a single biallelic locus, we can obtain some approximate analytical results with the Moran model. We will, however, obtain exact results with the TCT model and leave it to the reader to show that the results hold approximately for the Moran model. Assume a mutation rate of µ1 from allele 2 to allele 1 and µ2 for the reverse direction and set θ1 = N µ1 and θ2 = N µ2 and θ = θ1 + θ2 . Consider a table as in the table in Section 2; let the first column represent the effect of only drift, the second column of only mutation, and the third column of both : State x+1 x x−1 Drift only 1/N 2 x(N − x) 1 − 2/N 2 x(N − x) 1/N 2 x(N − x) Mutation only 1/N 2 θ1 (N − x) 1 − 1/N 2 (θ1 (N − x) + θ2 x) 1/N 2 θ2 x both Drift and Mut. 1/N 2 (x(N − x) + θ1 (N − x)) 1 − 1/N 2 (2x(N − x) + θ1 (N − x) + θ2 x) 1/N 2 (x(N − x) + θ2 x) ,

(since the birth/death and mutation events are mutually exclusive, their probabilities can be added). The equilibrium distribution is a beta-binomial distribution with α = θ1 and β = θ2 : Pr(x|θ, N ) = N! Γ(θ) Γ(x + θ1 )Γ(N − x − θ2 ) . x!(N − x)! Γ(θ1 )Γ(θ2 ) Γ(N + θ) (3)

It is actually easier to show that under these assumptions detailed balance holds, i.e.,, that the flow from x to x + 1 is the same as that from x + 1 to x. The transition probability from t to t + 1 is Pr(x + 1 | x, θ, N ) = 1/N 2 (N − x)(x + θ1 ); that from t + 1 to t is Pr(x | x + 1, θ, N ) = 1/N 2 (x + 1)(N − x − 1 + θ2 ) (compare to Table 2.2.1), for any x < N . Under detailed balance, we thus have: Pr(x + 1 | x, θ, N ) Pr(x | θ, N ) = Pr(x | x + 1, θ, N ) Pr(x + 1 | θ, N ) Γ(x + 1 + θ1 )Γ(N − x − 1 + theta2 ) Γ(x + θ1 )Γ(N − x + θ2 ) = (x + 1)(N − x − 1 + θ2 ) (N − x)(x + θ1 ) x!(N − x)! (x + 1)!(N − x − 1)! Γ(x + 1 + θ1 )Γ(N − x + θ2 ) Γ(x + 1 + θ1 )Γ(N − x + θ2 ) = . x!(N − x − 1)! x!(N − x − 1)! (4) Note that detailed balance is stronger than the existence of an equilibrium distribution, i.e.,, if detailed balance holds, an equilibrium distribution exists, but if an equilibrium distribution 4

exists, detailed balance may not hold (the flow in equilibrium may go in a circle from A to B to C and back to A). It follows therefore that the equilibrium distribution is a beta-binomial distribution with α = θ1 and β = θ2 .


Extension to parent independent mutation model

Assume K alleles and parent-independent mutation: from any allele k, all K alleles are reached with a mutation rate of µ1 , . . . , µK . The mutation rate from each allele is µ = µk . For ease of notation set θk = N µk and θ = θk . Generalize the equilibrium distribution from two to K alleles and show that detailed balance holds also for this model.


The Coalescence under Neutrality (Introduction)

Generally our emphasis will be on the analysis of real data. This was simplified by the introduction of the coalescence (Kingman, 1982; Hudson, 1983b,a; Tajima, 1983), and the subsequent theoretical work. In the coalescence, we are given a sample from extant time and look backward to the ancestors of this sample.


No mutation, TCT and Moran models

Let us consider the TCT (or Moran) model without mutation and consider the extant sample (of size N ). Looking backward in time, at each event, a single individual is chosen (with probability 1/N ) to die and replaced by another. The number of ancestors in the previous event is thus reduced by one with probability (N − 1)/N ; only if the individual is chosen to replace itself (with probability 1/N the number of ancestors stays the same. We will term the reduction in numbers of ancestors by drift coalescence. Let us generally consider a sample of size n, with 2 ≤ n ≤ N ; the number of ancestors of this sample in the previous event is reduced by one if any of the n is chosen and if the individual with which it unites is also from the sample, but not the individual itself. Thus the probability of a reduction of ancestors per event in the TCT and Moran models is n(n − 1)/N 2 . With the Moran model this makes for a geometric distribution of coalescence probability with rate n(n − 1)/N 2 . With the TCT model, we have an exponential distribution with the same rate. With both models, we could rescale the time.


No mutation, Wright-Fisher model

We need to assume that n is much smaller than N . Then only the probability of two individuals uniting in the previous generation is appreciably, whereas the probabilities of three individuals or two pairs of individuals uniting is negligible. The probability of two individuals coalescing in the previous generation is equal to the number of possible pairs in the sample n(n − 1)/2 times the probability of the two uniting in the previous generation 1/N . Remark: With the Wright-Fisher model, the rate of coalescence per generation is higher by a factor of two in the Moran model. This factor of two corresponds to that in the reduction of heterozygosity, if generation times are scaled appropriately. The difference is explained by the fact that only one individual is chosen to die in the Moran model but two individuals need to unite in the Wright-Fisher model.



The coalescence with mutation

Irrespective of the exact mutation model and the genotype of the individuals , we can model a coalescence with mutation at constant rate. Suppose, as before, a sample of size n with scaled mutation rate θ = N µ. Then, looking backward, we choose with proportion n(n − 1) a coalescence (we thus assume the Moran or TCT model; otherwise, we would divide by 2) and with proportion nθ/2 a mutation. We simulate backward until we have reached a sample size of 1. (Mutations or coalescences occurring before this time do not influence the variability in the sample.) Thus, for each n, the number of mutations occuring while n haplotypes (or individuals) are present is geometrically distributed (with p = (n − 1)/(n − 1 + θ)). The total number of mutations until a sample size of n = 1 is reached is therefore a sum of n − 1 geometrically distributed variables.


Reversal of the Coalescence

As long as the parameters N (and later θ) are time-independent, we can reverse the process of the coalescence. We then start with two individuals, i.e.,, i = 1 and work “upwards”, such that we run a geometric distribution for mutations with p = i/(i + θ), then add another individual, and iterate until we have reached n. The advantage is that we can now simulate conditional on the ancestral state Ethier and Griffiths (compare 1987); Donnelly and Kurtz (compare 1996). Simulation of a biallelic coalescence The following little R-program simulates a biallelic sample of size N with mutation rate θ in both directions. # # # # # # coal_2.R coalescence model simulation biallelic model N sample size

theta=1 #scaled mutation rate N=20 inds=rep(-1,N+1) #initialize to impossible value! size= N+1 and discard last inds[1]=inds[2]=0 #set first two inds to possible value = zero or one for(s in 2:N){ #get number of mutations from the geometric distribution nmut=rgeom(1,s/(s+theta)) samples=sample(size=(nmut+1),x=seq(1,s),replace=TRUE) if(nmut>0){#to protect against counting down in for loop for(m in 1:nmut){#R is too smart if(inds[samples[m]]==0){ inds[samples[m]]=1 } else{ inds[samples[m]]=0 } } } 6

inds[s+1]=inds[samples[nmut+1]] #add individual } inds[1:N] #sample size only N


Watterson’s θ

Suppose we could detect every mutation that segregates in the sample (i.e., that is polymorphic). The mean and the variance of this number could then be calculated, since the expectation of the geometric distribution is (1 − p)/p, the expected potential number of segregating mutations in the sample m is in total:
1 1

E(m | n, θ) =

θ/i = θ



and the variance:
1 1 1

Var(m | n, θ) =

θ(i + θ)/i2 = θ

1/i + θ 2



We could then estimate θ using θw = m/ 1 i=n−1 1/i. This estimator of θ was introduced by Watterson (1975) in the context of DNA sequence polymorphism. We will discuss this in depth later. With the TCT model, we obtain the above results exactly, irrespective of N (as long as n ≤ N , which is trivial), with the Moran and Wright-Fisher model in the limit of N → ∞ and N µ → θ (up to a factor of two or four).


The Coalescence with Parent Independent Mutation

In the following subsections we will introduce different mutation models. Let us now consider parent independent mutation where the scaled mutation rate to each of K alleles is θk and that from each allele is θ = θk . (This model is also called the K-allele model.) With parent independent mutation, we gain no information on the parental type from the state of the offspring. Thus, not only with a coalescence do we lose an individual, but also with a mutation. Hence, the coalescence with mutations finishes in n − 1 steps. Instead of a geometric distribution of mutations at each n, now there is just a Bernoulli. With probability θ/(i + θ) a mutation occurs; otherwise a coalescence. The mean number of mutations in the sample is then

E(m | n, θ) =

θ/(i + θ)


(or this value plus one, if we count the original type as mutation too). For n → ∞, one can show that E(m | n, θ) ≈ θ log n) (where we counted the plus one). The variance is:
1 1

Var(m | n, θ) =

θ · i/(i + θ)2 = θ

i/(i + θ)2 ;


for large n, Var(m | n, θ) ≈ θ log n as i/(i + θ) then goes to 1. 7


The Coalescence with Infinitely Many Alleles

The coalescence with the infinite allele model (IAM) follows from the parent independent mutation model with θ1 = · · · = θK as K goes to infinity. Then every new mutation can be recognized, but gives no information about its parent. Biologically, it can be motivated by the fact that with moderate mutation rates nearly any mutation will lead to a new mutation as, on average, just 0.001 or less sites are polymorphic in most equkaryotes and mutation rates are multiple hits by mutation are thus unlikely. If this situation is modeled with the IAM, information on parents is lost, i.e.,, information is wasted. With the IAM, it can be shown that the number of different alleles in the sample m is a sufficient statistic of θ. The distribution of m is: Pr(m | θ, n) = S(n, m)θ m /(n + θ)(n − 1 + θ) · · · (θ) (9)

where the constant S(n, m) is the absolute value of the Stirling number of the first kind. Conditional on m, the distribution of individuals in the m classes is (no derivation): Pr(n1 , . . . , nK | m) = 1/S(n, k) n! K!n1 · · · nk (10)

where the alleles are numbered arbitrarily from 1 to m and nk is the number of individuals of allele k (Tavar´ and Ewens, 1997). This distributiin is independent from θ from which it follows e that K is a sufficient statistic. The mean and variance of K are given in equations 7 and 8. This implies a rather slow convergence to the true value for the estimator K/ log n for θ with increasing n!


Limit of High Recombination Rates; One Population, Constant θ

In the limit of high recombination rates (i.e., assuming linkage equilibrium), coalescence of neighboring sites is independent. We can then consider each polymorphic site separately and obtain the distribution for the whole data sample by multiplication over sites. Furthermore, since per basepair mutation rates are low (about 10−9 ) and the regions considered are short, the relevant parameter region is the limit of small θ. If the genomic DNA of a sample of many individuals of a population is sequenced using deep sequencing, the information on linkage is almost completely lost, such that assuming complete linkage equilibrium makes very efficient use of the data. In the following, we will only consider the case where the ancestral state is known, i.e., where a suitable outgroup is available, and we will assume that θ is constant over time. This assumption can be dropped later. Consider a sample of size n from one locus; the data are then completely characterized by the number of derived sites (or haplotypes) y, which must be between 0 and n. The case of n derived sites is not only influenced by the population in focus, but also by the timespan to the outgroup. Hence we will condition on 1 ≤ y < n. The n − 1 coalescences provide slots for mutations to occur. Introduce the variable s that indicates the slot in which the mutation occurs, e.g., s = 0 for no mutation, s = n − 1, for a mutation before the first coalescence, and so on until s = 1 is reached. (The order is reversed since we are looking backwards in time). Given θ, the probability of the mutation to occur in the sth slot is: θ Pr(s | θ) = s+θ
n−1 i=1

i . i+θ



In the limit of small θ, the probability of m mutations is approximately proportional to:
n−1 m

Pr(m | θ) ∝




In the limit of small θ, the probability of two or more mutations is practically zero and the n−1 probability of a single locus (or SNP) to be monomorphic is thus 1/(1 + θ i=1 1/i) and that n−1 n−1 of it being polymorphic is (θ i=1 1/i)/(1 + θ i=1 1/i). Given m = 1 and in the limit of small θ, we have for 1 ≤ s ≤ n − 1: Pr(s | m = 1) ∝ 1 . s (13)

We see that the distribution of s conditional on m = 1 is independent of θ as long as θ is small. Thus, no information can be gained from knowing when a mutation has occurred; the only information is whether or not one has occurred. Distribution of y in the limit of small θ. The allele distribution y is independent from θ conditional on m = 1 and s. Therefore y also has to be independent from θ for small θ. For inference of θ in the limit of small θ under neutrality the distribution of y is therefore noninformative. In this paragraph, we nevertheless give this distribution. The probability to obtain the spectrum y given s is given by the Binomial-Beta compound distribution (or Polyaurn distribution). Adjusted to our variables, this is: Pr(y | n, s) = (y − 1)!(n − 2 − (y − 1))! (n − 1 − s) s! . (s − 1)!0! (n − 1)! y−1 (14)

Combining formulas (13) and (14), we get:

Pr(y | θ) =

Pr(y | s) Pr(s | θ) ∝ 1/y .


This result and the independence of the distribution of the allelic spectrum y from θ, in the limit of small θ, has been obtained numerous times before (e.g., Sawyer and Hartl, 1992). Let us now proceed with the inference of θ from data in the limit of small θ. Assume we have altogether L loci. The distribution of polymorphic loci will then follow a binomial distribution with p polymorphic and L − p non-polymorphic sites. (We note again that p is a sufficient statistic for small θ; nothing can be gained from the allelic spectra of the loci). n−1 By our assumptions, θ( i=1 1/i) ≪ 1 such that the expectation of p is much less than L − p n−1 and we can approximate the binomial by a Poisson distribution with mean Lθ( i=1 1/i)/(1 + n−1 n−1 θ( i=1 1/i)) ≈ Lθ( i=1 1/i), i.e.,:

Pr(p | θ, L, N ) ∝ θ p exp(−θ(

1/i) L) .


n−1 The ML estimate of θ is p/(L · i=1 1/i). This is the same result as that given by Sawyer and Hartl (1992). Obviously, the conjugate prior for θ is a scaled gamma distribution with some α and β

Pr(θ | α, β) ∝ θ α−1 exp(−θ β) L) , 9


such that the posterior is again a scaled gamma distribution:

Pr(θ | p, L, n, α, β) ∝ θ


exp(−θ(β + (

1/i) L)) .


The argument can be generalized to loci with variable n(l), where l indexes loci (1 ≤ l ≤ L); the posterior distribution of θ will remain a scaled gamma:
L n(l)−1

Pr(θ | p, n(l), α, β) ∝ θ


exp(−θ(β + (
l=1 i=1

1/i))) .



Coalescence with the K-sites model

Modify the program for simulating from the coalescence to a K-sites mutation model. 3.9.2 The Polya-urn distribution

Prove formula 14 using induction on n. 3.9.3 Continuous Time Sampling

Extend the discrete coalescence sampling in section 3.4 to continuous time. The continuous version has been introduced and used by Stephens and Donnelly (2000).


The Infinite Sites Models without recombination
The K-SitesModel

In this subsection, we will study the K-sites and the extension to infinite sites models in more detail. These are obviously models for DNA sequence evolution, more specifically for single basepair polymorphisms (SNPs). Generally, mutation is assumed to be independent in each of the K sites and toggles between zero and one with a rate of µ per site. With K sites, we thus have 2K possible haplotypes. One can number them in different ways. Obviously, a mutation enables transitions only between haplotypes that are different at exactly one base, such that mutationonal transitions between many haplotypes require intermediates. Hence, it can be shown immediately that detailed balance cannot hold: consider a transition of a certain haplotype i between xi = 0 and xi = 1 that is offset by a change of in abundance of haplotype j from xj + 1 to xj and suppose that these two haplotypes are separated by more than one mutation. For detailed balance, we require that the flow between these states is equal. Yet drift may get rid of the one xi but cannot restore it. Hence the flow cannot be balanced. Nevertheless, as we already showed in our previous examples, sampling is easily possible with the K-sites model with the Moran and the CTC forward model. It is also easy with the coalescence model, both backward in time (i.e., a “real” coalescence) or forward in time, i.e., some form of urn model. With real datasets, we often have an “outgroup” available, i.e., a sequence from a closely related species that allows us to polarize our polymorphisms in ancestral and derived.



The Infinite SitesModel

The infinite sites (IS) model is by far the most important mutation model. In the previous section, we already dealt with a special case, the high recombination rate limit (which corresponds to the low mutation rate limit). In this section, we will devote our attention to the other limit of no or negligible recombination rate. This is not motivated by genetics , but by convenience. Effective recombination rates are of the same order or higher than effective mutation rates. But intermediate recombination rates (same order of magnitude as mutation rates) are tough to handle. Usually, the per-site scaled mutation rate θ = N µ (up to a factor of two or four) is on the order of 10−2 to 10−6 and sequences are quite long (500 bps or longer). Thus, the per site mutation rate is so low that multiple hits of the same site in the sample are virtually impossible and we can detect every mutation as polymorphism in the sample. This leads to the infinite site (IS) model. As with most other mutation models, it is easier to simulate with the coalescence from the bottom up, conditional on a founder haplotype than vice versa. Again, coalescence (urn) simulation is approximate for the Wright-Fisher and Moran model and exact for the TCT model.

4.3 4.4

Inference with the Infinite Sites Model ˆ Watterson’s θw , once more

With the infinite site (IS) model, the number of polymorphic sites corresponds to the number of mutations in the sample. We can thus count the number of polymorphic sites and use ˆ Watterson’s θw as estimator of θ. With low θ, this makes very efficient use of the data (as we realized when we considered the high recombination rate limit). With high θ, we gain information on θ from the distribution of mutations. If, in a sample of n, all mutations happened when only two haplotypes were present, θ will be on average smaller than when all mutations happened when already all n haplotypes were present. This is because when only two or a few haplotypes are present the variability in the length of the branches is highest. If θ is high and these variable branches are by chance short, the number of polymorphic sites in the sample may be low. The converse is true, the other way round. To see how much difference this makes, we will assume that we know that all mutations in a sample of n were before and after all coalescences and calculate the likelihood with the following program: # coal_3.R # # The inefficiency of Watterson’s theta # n=20 #sample size S=25 #polymorphic sites vtheta=seq(0,50,0.1) xlow=rep(-999,length(vtheta)) xhigh=xlow vi=seq(1,(n-1),1) for(i in 1:length(vtheta)){ theta=vtheta[i] dconst=sum(log(vi/(vi+theta))) #the log of product of coalescences


xlow[i]=dconst+S*log(theta/(1+theta)) #the log of product of mutations at n=2 xhigh[i]=dconst+S*log(theta/(n+theta)) #the log of product of mutations at n=sample size } thetaw= S/sum(1/vi) plot(vtheta,xlow,type=’l’) lines(vtheta,xhigh,lty=2) abline(v=thetaw) ˆ Oviously, θw is striking a good compromise, but loses all information on the timing of mutations relative to coalescences. We see that we could improve inference if we knew exactly when (in relation to the coalescences) the mutations occurred. But even with the IS model, this is not possible. Nevertheless, polymorphisms that occur in a single sampled individual are likely to have occurred relatively recently (looking backwards in time), whereas if the sample is divided by a single deep split into two groups separated by many mutations, mutations must have occurred relatively far backwards. In the next part, we will make complete use of the information in the IS model.



Given a sample of n haploid individuals that evolved according to the IS model, we realize that the individuals can be organized as “boxes within boxes”, such that we can construct the “unrooted gene tree”, i.e., we can build a hierarchical tree of relatedness among individuals on which we can distribute the mutations. With an outgroup, we can root the tree to obtain the “rooted gene tree”. In the following, we will consider only a rooted tree; the case of the unrooted tree follows by summation over all possible roots weighted by their probabilities. Consider a data set with n haploid individuals (or haplotypes) containing m mutations evolving under the IS. Assume an outgroup, and determine the rooted gene tree. Obviously, there are altogether S = n + m steps to the root of the tree. There are many possibilities (call these coalescence trees) to reach this root, by rearranging the sequences of mutations and coalescences, even though the unrooted gene tree is given by the data. A naive enumeration of all these coalescence trees and then to sum over their respective probabilities is tedious and computationally impossible, except in the most trivial data sets. We will thus proceed by using the Markov property of the coalescence and sum relatively efficiently over the coalescence trees. 4.5.1 “Summing out” the coalescence trees using the forward algorithm

Consider each individual separately. Obviously, even before observing anything, each individual has just two possibilities in the previous step: it may either mutate or coalesce with the respective probabilities of θ/(n(n − 1) + nθ) and (n − 1)/(n(n − 1) + nθ). Given the rooted gene tree, only zero or one possibilities remain: it may either coalesce, if there are altogether k individuals of its type with transition probability k(k − 1)/(n(n − 1)+ nθ), or it may mutate, if it is separated by a mutation from the next junction in the rooted tree with transition probability θ/(n(n − 1) + nθ), or it remains without a change if there are no other individuals of its type and it is not separated from the next junction by a mutation. If we go through all n individuals, i.e., at the end of the first (and before the second step), we get get quite a number of possible configurations indexed by k, ck (s = 1) and their conditional probabilities fk (s = 1). In following steps, we proceed as in the first step and go through all individuals of each possible configuration. We can now get the same configuration by different sequences of elementary


events. Since they are mututally exclusive, we sum their probabilities. After S stelps we will eventually end at the root and obtain the probability of the rooted gene tree given the data. In detail, the algorithm can be summarized as follows: • Initialization (s = 0): Set the configuration at s = 0 to the data set and its probability to unity, i.e., f1 (0) = 1; • Recursion (s = 1, . . . , S − 1): Generate all configurations compatible with the data that are one step below the current configurations (i.e., have one fewer haplotype or one fewer mutation, but not both), index them by K, with 1 ≤ k ≤ K, and calculate their probabilities using: fk (s + 1) = l fl (s)plk , where plk = Pr(cl (s) | ck (s − 1), θ) are the transition probabilities. • Termination: The likelihood is Pr(y | θ) = f1 (S). This algorithm is a variant of the forward algorithm employed in the theory of hidden Markov processes. The number of possible configurations for intermediate s can be quite large. Nevertheless, summation from one step to the next is vastly more efficient than evaluating all possible paths until the final coalescence. We note that fk (s) cannot be interpreted as the probability of the configuration at the step. Only the final f1 (S) corresponds to the probability of the sum of all coalescence trees, i.e., the likelihood Pr(y | θ). 4.5.2 Example

To illustrate the method, consider a toy data set with only one segregating site: ID Sequence # 1 1 3 2 0 2 where 1 denotes the derived state. Then we have the following sequence of possible configurations ck (s) of the two haplotypes (numbers of haplotypes in parentheses). Step 0 1 1 2 2 3 3 4 5 ID c1 (0) c1 (1) c2 (1) c1 (2) c2 (2) c1 (3) c2 (3) c1 (4) c1 (5) haplotype frequencies (2,3) (1,3) (2,2) (1,2) (2,1) (1,1) (3,0) (2,0) (1,0)

Denote a tenth type as the one that is incompatible with the data. We see that we can move, e.g., in the first step, from c1 (0) to either c1 (1), with probability proportional to 2 · 1/2 = 1, or f2 (1), with probability proportional to 3·2/2 = 3. The proportionality constant is the probability of moving to any type, i.e., in both cases 5 · 4/2 + 5θ = 10 + 5θ. Hence, the probability of moving to the incompatible state is (6 + 5θ/2)/(10 + 5θ/2) The transition matrix for the configurations


(c1 (0), . . . , c1 (5)) to configurations (c1 (0), . . . , c1 (5)) then is: 0     
0 0 0 0 0 0 0 0 0 1/(10 + 5θ/2) 0 0 0 0 0 0 0 0 0 3/(10 + 5θ/2) 0 0 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 0 0 1/(3 + 3θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 (θ/2)/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 0 (θ/2)/(1 + 2θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 0 1/(1 + 2θ/2) 1 0

(6 + 5θ/2)/(10 + 5θ (5 + 4θ/2)/(6 + 4θ (4 + 4θ/2)/(6 + 4θ (2 + 3θ/2)/(3 + 3θ (2 + 2θ/2)/(3 + 3θ (1 + θ/2)/(1 + 2θ/ (2 + 3θ/2)/(3 + 3θ (2θ/2)/(1 + 2θ/2 0 1

(20) Since the likelihood of the type incompatible with the data is 0, we can simply eliminate the tenth row and column from the transition matrix to obtain: 0    
0 0 0 0 0 0 0 0 1/(10 + 5θ/2) 0 0 0 0 0 0 0 0 3/(10 + 5θ/2) 0 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 0 1/(3 + 3θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 (θ/2)/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 (θ/2)/(1 + 2θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 1/(1 + 2θ/2)) 1



  . 


The Inefficiency of Watterson’s θ

Even with the infinite site model, we rarely know for sure when, in relation to the coalescences, mutations occurred. We, however, have much more information available than with a parentindependent mutation model. Consider a situation, where we have the same number m of polymorphisms present in the data. If the data contain exclusively singletons, i.e., each mutation is only present in a single haplotype, the mutations likely occurred recently. If, on the other hand, two groups of haplotypes are separated by polymorphisms shared within a group, the mutations likely occurred far back in time. As above, we reason that if, in a sample of n, all mutations happened when only two or very few haplotypes were present, θ will be on average smaller than when all mutations happened when already all n haplotypes were present. This is because when only two or a few haplotypes are present the variability in the length of the branches is highest. If θ is high and these variable branches are by chance short, the number of polymorphic sites in the sample may be low. The converse is true, the other way round. The following constructed example shows that this may make a difference:












10 Theta



Here, the two solid lines are the posterior distributions from the mock examples from above. The black solid line represents all mutation events at the tips, the black stippled line all mutation events at the root, the red solid line all two differentiated groups, and the black stippled line all singletons. We see that, in this extreme constructed case, we gain some information with the likelihood method over Watterson’s estimate of θ. Nevertheless, the gain will be much less in real data sets.


Number of Roots

Caclulate the number of rooted trees given an unrooted coalescence tree with mutations.



Reversal of Time, Probabilities of ck (s)

Find an algorithm that traverses from the root of the rooted gene tree up to the extant data set, again step by step, denoting the quantities corresponding to fk (s) with bk (s). Call this the backward algorithm. Find the probability of ck (s) using fk (s), bk (s), and the likelihood f1 (S).


Population Demography

Generally, populations do not conform to the model of large and constant population sizes over appreciable times. Rather, population sizes grow and shrink. This makes the scaled mutation rate θ a function of time, i.e., θ(t). Other than that, the coalescence proceeds as before. The states are still Markov, i.e., the next coalescence/mutation depends only on the previous state and not on what has gone on before. But this Markov property of states does not imply a Markov property of times.


One Change in θ

Let us consider a very simple model: one change from θc to θa (usually by a change in effective population size N ) at time tc , which we assume conveniently to be scaled by Nc . If we consider a coalescence model (backwards in time), the times of the first coalescence or mutation are distributed exponentially with rate (n(n − 1) + nθc ) until tc and with (n(n − 1) + nθa ) thereafter. If we prefer a discrete time model, the exponential distribution would need to be replaced by a geometric one. After the first coalescence, the same process continues until tc for the second, and so on. We thus see that we need to explicitely reinstate time for the transition probabilities between the coalescence states. In practice, this means one more loop in the programs, e.g., for the infinite site model with free recombination: TwoNe=1000 TMAX=10*TwoNe N=10 #samplesize nu=1/TwoNe # Generate the y-Matrix y=matrix(rep(0,(TMAX+1)*N),nrow=TMAX+1,ncol=N) y[1,]=c(1,rep(0,N-1)) for(t in 2:TMAX){ y[t,1]=(1-choose(N,2)*nu)*y[t-1,1] for(i in 2:(N-1)){ y[t,i]=(1-choose(N-i+1,2)*nu)*y[t-1,i]+choose(N-i+2,2)*nu*y[t-1,i-1] } y[t,N]=y[t-1,N]+choose(2,2)*nu*y[t-1,N-1] } y[TMAX+1,]=c(rep(0,N-1),1) plot(y[,1],ylab=’Probabilities’,type=’l’,xlim=c(0,TMAX+1)) for(i in 2:N){ lines(y[,i]) }


#Get Polya probs pPolya=matrix(rep(0,(N-1)*(N-1)),nrow=N-1) for(s in (N-1):1){ for(x in 1:(N-s)){ pPolya[N-s,x]=choose(N-1-s,x-1)/beta(s,1)*beta(N-x,x) } } j=seq(0,TMAX/10,1) pmat=matrix(rep(0,(N-1)*length(j)),ncol=(N-1)) for(k in j){ tchange=k*10 #tchange #Multiply with the theta ratio and the number of mutation possibilities theta_ratio=0.1 pS=rep(0,N) for(i in 1:N-1){ pS[i]=(N-i+1)*(sum(theta_ratio*y[1:tchange,i]) +sum((y[(tchange+1):length(y[,i]),i]))) } pS=pS/sum(pS) #Get posterior for(s in (N-1):1){ for(x in 1:(N-s)){ pmat[k+1,x]=pmat[k+1,x]+pS[N-s]*pPolya[N-s,x] } } pmat[k+1,]=pmat[k+1,]/sum(pmat[k+1,]) } # pmat is the result of all these calculations pmat


Transition Matrix

Modify the transition matrix in formula 21 to the “one change in θ” model. 5.2.2 Exponential Growth

In the literature, exponential growth models (forward in time) from some Na to Nc within a time of t are popular. How many parameters are needed for this model, if exponential growth continues to the presence, and how many are needed if exponential growth occurs for an interval that finishes before the present time? Modify the above program of stepwise increase/decrease in population size to one with exponential increase, where the exponential increase/decrease has not yet finished.



Structured Populations

A species may be mainly homogenous, i.e., panmictic, or subdivided into demes. The earlier analyses apply strictly only in the former case. The latter case of population subdivision will be considered in this chapter. The most influential model of population subdivision is Wright’s island model.


The Infinite Island Model Without Mutation

A simple case is the infinite island model, where it is assumed that infinitely many demes persist indefinitely (this assures that allele frequencies do not change with time in the absence of mutation), each with the same population size N (or more general and in accord with the literature: 2Ne ), exchange m migrants (or 2m gametes if individuals are assumed diploid) per generation in such a way that each population contributes to a common migrant pool out of which the migrants are drawn (i.e., there is no effect of neighborhood). Since all demes are equivalent, we can pick a particular one. For simplicity we assume a biallelic locus, where the allele frequency of one allele is p for all times (absence of mutation and drift in the general population, because of the infinite number of demes); extension to multiallelic loci is straightforward (generalization from the binomial-beta to the multinomial-Dirichlet distribution). 6.1.1 One population, p given

Again, we can either model with a continuous time model, or with discrete generations. For simplicity, we choose the continuous time model. As before, in the case of the biallelic mutation model, individuals die and are replaced by other individuals at rate of 1/N 2 . At each migration step, a random individual is chosen out of the N individuals within the deme for being replaced by a migrant. Set γ = N m. (Attention: in the context of subdivided populations θ is used differently than in the context of mutation!). We construct a table as above (Table 2.2.1)— birth/death and migration events are mutually exclusive such that their probabilities can be added: State x+1 x x−1 both Drift and Mut. − x) + γp(N − x)) 1 − 1/N 2 (2x(N − x) + γp(N − x) + γ(1 − p)x) 1/N 2 (x(N − x) + γp(N − x)) , 1/N 2 (x(N

The equilibrium distribution is again a beta-binomial distribution with α = γp and β = γ(1 − p): Pr(x|γ, N ) = Γ(γ) Γ(x + γp)Γ(N − x − γ(1 − p)) N! . x!(N − x)! Γ(γp)Γ(γ(1 − p)) Γ(N + γ) (22)

As above, it can be shown that detailed balance holds, i.e.,, that the flow from x to x + 1 is the same as that from x + 1 to x. Thus the biallelic infinite island or migration-drift model is formally equivalent to the biallelic mutation-drift model. We are, in this case, mainly interested in γ. The distribution of γ is non-standard. We note that the Beta-binomial distribution can be broken up into two distributions: sample π, the proportion of the focal allele in the population, from a beta(γp, γ(1 − p)) and then sample x from a binomial(N, π). The beta-binomial can be obtained from the two distributions by “integrating out” π. 18


Fst or θ

Usually, instead of using γ one parametrizes using Fst = θ = 1/(γ + 1). (Again a warning of the different use of θ in the context of migration versus mutation!) The coefficient of coancestry Fst or θ is defined as the probability of coalescence of a pair of random alleles within the population. (The alternative is that one of the pair migrates before coalescence.) How we get there should be clear by now. We note that γ = (1 − θ)/θ. # coal_5.R # # theta=1/gamma+1); gamma=N*m, where m migration coefficient # dbetabin= function(x,N,alpha,beta,log=FALSE){ if(log==TRUE){ lchoose(N,x)-lbeta(alpha,beta)+lbeta(x+alpha,N-x+beta) } else{ choose(N,x)/beta(alpha,beta)*beta(x+alpha,N-x+beta) } } N=10 #sample size y=1 #sample size of first allele p=0.4 theta=seq(0.01,0.99,0.01) plot((1-theta)/theta,dbetabin(y,N,(1-theta)/theta*p,(1-theta)/theta*(1-p)),type=’l’) We note that with more independent loci, estimation of γ improves. The earlier assumption that all demes (or subpopulations) need to have the same effective migration rate γ or equivalently coefficient of coancestry θ is not necessary; each population may have its own. We now need to watch out with the indexing; generally l would be taken to index the loci and i the populations. (If multiple alleles are assumed, a may index the alleles.) 6.1.3 Other Definitions of Fst

Many definitions of Fst are available around. The above definition, Fst = θ = 1/(γ + 1), is equal to the probability of two random gametes coalescing within the population before reaching the pool of migrants. Often the following definition for Fst is given: Fst = 1 − observedheterozygosity , expectedheterozygosity (23)

But this is problematic. Rather “observed heterozygosity” should be replaced by expected heterozygosity in the deme or subpopulation. We then have: Fst = 1 − 2 E(π(1 − π) , 2p(1 − p) (24)

where π is defined as the proportion of the focal allele in the subpopulation. We note that the distribution of π is beta with α = γp and β = γ(1 − p), such that E(π) = p and Var(π) = 19

p(1 − p)/(γ + 1). It follows that E(π2) = p2 + p(1 − p)/(γ + 1). Substituting into equation (25) gives: 1 2 E(π) − E(π 2 )) =1−1+ . (25) Fst = 1 − 2p(1 − p) γ+1 This result is the same as above. Another definition for biallelic loci is: Fst = Var(π) , p(1 − p) (26)

where π is defined as the proportion of the focal allele in the subpopulation. With the mainland/island model (or equivalent), we have the variance of the Beta distribution: αβ/((α + β)2 (α + β + 1)), such that we get: Fst = 1 γ 2 p(1 − p)/(γ 2 (γ + 1)) = . p(1 − p γ+1 (27)

We see that all three definitions are equivalent.


p to be estimated, two or more populations

We generally do not know p and thus need to estimate it from the data. We then need at least two populations, because otherwise the estimation of γ and p is confounded. We will again just consider a particular locus and thus leave away indexing for the locus but will introduce the index i for the population. We are usually not interested in p, but only in γi . The natural strategy, for the case of p unknown, would need to be to supply a prior for p and “integrate it out”. A natural prior for p would be beta with 0 < α = β ≤ 1. We could use equation (22), subdivide the interval 0 < p < 1 into small enough parts and sum out p. This would be rather straightforward and should work relatively well, since the probability mass of p is usually spread out considerably, such that numerical problems need not be feared. I will, however, introduce a Markov chain Monte Carlo (MCMC) sampling scheme (Vogl et al., 2003) that converges to the posterior distribution. Generally, given the number of alleles of both types, i.e., m and M − m, the likelihood is binomial, such that p is beta distributed after multiplication with the beta prior: Pr(p | x, N, α, β) ∝ pα+m−1 (1 − p)β+M −m−1 (28)

The problem with the distribution of p is that m and M are actually random numbers, because we must only count those alleles that make it to the general population, i.e., that do not coalesce within the populations. Suppose that we have y1 out of N1 alleles of the first type in the first population and y2 out of N2 alleles of the first type in the second population. Conditional on γi and p, the probability of coalescence versus migration is independent for each allele and population i. Hence, we have the probability of migration if there are s individuals of the first allele: Pr(mi | γ, p, s) = γi p ; s − 1 + γi p (29)

the last individual has to migrate; and analogously for the second allele. Iterating over s, starting form s = m, we can draw a sample from mi and similarly for the other allele to obtain M .


Conditional on p, yi and Ni , we can update γ. Unfortunately, the distribution of γ is nonstandard. Furthermore, at least to me, an equal prior for θ seems more natural than one for γ, Hence, I would rather reparametrize and then use a Metropolis step to obtain a new θi for each population. Altogether, we then have: # coal_6.R # # theta=1/(gamma+1); gamma=N*m, where m migration coefficient # dbetabin= function(x,N,alpha,beta,log=FALSE){ if(log==TRUE){ lchoose(N,x)-lbeta(alpha,beta)+lbeta(x+alpha,N-x+beta) } else{ choose(N,x)/beta(alpha,beta)*beta(x+alpha,N-x+beta) } } #the parent independent mutation or migration distribution #returns the number of migrants/mutants rparentind= function(n=1,N=1,agamma=1){ if(N<1){ y=rep(0,n) } else if(N==1){ y=rep(1,n) } else{ y=rep(1,n) for(i in (N-1):1){ y=y+rbinom(n,1,agamma/(i+agamma)) } } y } allfreqs=matrix(c(3,8,10,10,8,2,10,10),nrow=2) #y_1,N_1,y_2,N_2 maxit=1000 #maximum number of iterations vresults=matrix(rep(0,maxit*3),ncol=3) theta=rbeta(1,1,1) #sample initial theta from prior p=rbeta(length(allfreqs[,1]),0.1,0.1) #sample initial p from prior for(it in 1:maxit){ ll=0; for(i in 1:length(allfreqs[,1])){ y1=c(rparentind(1,allfreqs[i,1],(1-theta)/theta*p[i]), rparentind(1,allfreqs[i,2]-allfreqs[i,1],(1-theta)/theta*(1-p[i]))) 21

y2=c(rparentind(1,allfreqs[i,3],(1-theta)/theta*p[i]), rparentind(1,allfreqs[i,4]-allfreqs[i,3],(1-theta)/theta*(1-p[i]))) p[i]=rbeta(1,0.1+y1[1]+y2[1],0.1+y1[2]+y2[2]) loglike=dbetabin(allfreqs[i,1],allfreqs[i,2],(1-theta)/theta*p[i], (1-theta)/theta*(1-p[i]),log=TRUE)+dbetabin(allfreqs[i,3], allfreqs[i,4],(1-theta)/theta*p[i],(1-theta)/theta*(1-p[i]),log=TRUE) thetaStar=theta+(0.05-runif(1)*0.1) if((0<thetaStar) && (1 > thetaStar)){ loglikeStar=dbetabin(allfreqs[i,1],allfreqs[i,2],(1-thetaStar)/thetaStar*p[i], (1-thetaStar)/thetaStar*(1-p[i]),log=TRUE)+dbetabin(allfreqs[i,3], allfreqs[i,4],(1-theta)/theta*p[i],(1-thetaStar)/thetaStar*(1-p[i]),log=TRUE) if(loglikeStar>loglike){ theta=thetaStar loglike=loglikeStar } if(runif(1)>exp(loglikeStar-loglike)){ theta=thetaStar loglike=loglikeStar } } ll=ll+loglike; } vresults[it,1]=ll vresults[it,2]=theta }

layout(mat=matrix(1:2,2,1,byrow=FALSE)) plot(vresults[,1],type=’l’,xlab="Iterations",ylab="log likelihood") plot(vresults[,2],type=’l’,xlab="Iterations",ylab="theta") layout(mat=matrix(1:1))


Exam Question

The basic situation is: a population of infinite size and without mutation where one allele of a biallelic locus has a proportion of p (for all times). A smaller population of constant size N splits from this population at time t in the past. The populations do not exchange any migrants after the split (before they were one panmictic population with free exchange of individuals or gametes). Part 1: Choose a definition of Fst and calculate Fst (N, t) = f (N, t) (choose either continuous or discrete time). Given N find the time t1/2 such that Fst (N, t1/2 ) = 0.5. (Note that only if you choose continuous time, this will in general be possible exactly; if you choose discrete time, take the closest value.) Part 2: Given N , the t1/2 from the previous part, and a p between 0.2 and 0.8, write a program with “R” to sample n haploid individuals (n about 10 or 20) from the subpopulation at time t = 0. Let the random variable y be the number of individuals with allele 1 in the 22

sample. (Remark: there are many possibilities to do this. Any approach covered in this course is allowed; do not use the diffusion approach!)


Donnelly, P. and Kurtz, T. (1996). A countable representation of the Fleming-Viot measurevalued diffusion. Annals of Probability, 24, 698–742. Ethier, S. and Griffiths, R. (1987). The infinitely many sites model as a measure-valued diffusion. Annals of Probability, 15, 515–545. Fisher, R. (1930). The Genetical Theory of Natural Selection. Clarendon Press, Oxford. Hudson, R. (1983a). Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. Hudson, R. (1983b). Testing the constant-rate neutral allele model with protein sequence data. Evolution, 37, 203–207. Kingman, J. (1982). On the genealogy of large populations. Journal of Applied Probability, 19A, 27–43. Moran, P. (1958). Random processes in genetics. Proc. Camb. Phil. Soc., 54, 60–71. Sawyer, S. and Hartl, D. (1992). Population genetics of polymorphism and divergence. Genetics, 132, 1161–1176. Stephens, M. and Donnelly, P. (2000). Inference in Molecular Population Genetics. Journal of the Royal Statistical Society, Series B, 62, 605–655. Tajima, F. (1983). Evolutionary relationship of DNA sequences in finite populations. Genetics, 105, 437–460. Tavar´, S. and Ewens, W. (1997). Multivariate discrete distributions, chapter The Ewens same pling formula. Wiley. Vogl, C., Das, A., Beaumont, M., Mohanty, S., and Stephan, W. (2003). Population Subdivision and Molecular Sequence Variation: Theory and Analysis of Drosophila ananassae Data. Genetics, 165, 1385–1395. Watterson, G. (1975). On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7, 256–276. Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16, 97–159.