Document Sample

Population Genetics and the Coalescence Claus Vogl June 16, 2009 Institut f¨r Tierzucht und Genetik, Veterin¨rmedizinische Universit¨t Wien, Veterin¨rplatz 1, u a a a A-1210 Vienna, Austria E-mail: claus.vogl@vu-wien.ac.at Preface: This manuscript was written for an “Universit¨tskurs Biometrie” in the spring a semester of 2009. Please indicate mistakes to me (e-mail with subject containing: “mistake”! Thanks!!!) 1 1 Introduction Population genetics deals with allelic variation in natural or artiﬁcial populations. Among population genetic forces, new variation is generated by mutations; drift (stochastic loss of alleles) generally reduces varation; selection may increase or decrease variation. Fisher (1930, and earlier) and explicitly Wright (1931) developed independently a simple population genetic model with discrete (non-overlapping) generations. Moran (1958) developed a continuous time model with overlapping generations. For the Moran model, exact results are sometimes available that hold approximately (actually very well for only moderately large population sizes) also for the Wright-Fisher model. Both models go forward in time, i.e., from one generation to the next in the case of the Wright-Fisher model, and from one birth/death event in the Moran model to the next. In this course, we will mainly use the Moran model. 2 The Moran model The Moran model is supposed to model continuous time. But in its usual form, it actually moves from one step to the next. (Below, we will suggest a truly continuous model that diﬀers also in the mutation aspect from the Moran model.) With the Moran model, in the haploid pure-drift case, at any timestep, i.e.,, from t to t + 1, a population of N haploid individuals is assumed. An individual is picked randomly to reproduce. Suppose the ith individual reproduces. Then another individual j is chosen to die and is replaced by the ith individual. (We will call this a birth/death event.) The process repeats indeﬁnitely. The lifespan of an individal is geometrically distributed with a mean of N . It may thus be useful to rescale time in units of N , i.e., to set τ = t/N , because this is the average lifetime of an individual. This process of drift, will lead to a stochastic loss of variation. Since no new mutations replenish the variation, a monomorphic population will eventually result. With a biallelic model, a transition from one step to the next involves just three possibilities, either the number of the focal allele x increases by one, it remains the same, or it decreases by one: State x+1 x x−1 Probability with Drift only 1/N 2 x(N − x) 1 − 2/N 2 x(N − x) 1/N 2 x(N − x) Denote the current state with p0 = x/N and consider the heterozygosity 2pt (1 − pt ), i.e.,, the probability of obtaining two diﬀerent alleles in a sample of two at generation t. The expectation of p1 is E(p1 ) = p0 . The expectation of p2 is E(p2 ) = p2 + 2/N 2 p0 (1 − p0 ) (calculated directly 0 1 1 from Table 2). Hence the expected heterozygosity is: E(2p1 (1 − p1)) = 2(E(p1 ) − E(p2 )) 1 = 2(p0 − p2 − 2/N 2 po (1 − p0 ) 0 = 2p0 (1 − p0 )(1 − 2/N ) . For t steps, this equation needs to be iterated, such that we obtain: E(2pt (1 − pt )) = 2p0 (1 − p0 )(1 − 2/N 2 )t ≈ 2p0 (1 − p0 )e−t/N = 2p0 (1 − p0 )e 2 2 (1) 2 (2) . −τ /N Thus heterozygosity decreases at the rate of 1/N 2 per step or 1/N per “generation”. 2.1 The Moran Model with Mutation With the Moran model, mutation is assumed to occur at every birth/death event, i.e.,, the replacing gamete is assumed to mutate at a rate µ. Many diﬀerent mutation models have been used; below we will use the K site model. Each site is assumed to toggle between two states (0 and 1). A little R-program for the K-site model: # # # # # # coal_1.R Moran model simulation K-site model Ne haploid individuals Ne=100 #number of individuals K=10 #number of sites inds=matrix(floor(runif(Ne*K,min=0,max=2)),nrow=Ne,ncol=K) maxTime=Ne*Ne*10 #maximum time mu=0.001 #mutation rate first=ceiling(runif(maxTime,min=0,max=Ne)) second=ceiling(runif(maxTime,min=0,max=Ne)) for(t in 1:maxTime){ #birth/death step inds[second[t],]=inds[first[t],] #mutate mutvec= runif(K,min=0,max=1)<mu for(k in 1:K){ if(mutvec[k]==TRUE){ if(inds[second[t],k]==0){ inds[second[t],k]=1 } else{ inds[second[t],k]=0 } } } } 2.2 Truly Continuous Time (TCT) Model Instead of tying the mutation to birth/death events (as in the Moran model), we can independently choose an individual to mutate at a rate µ. It then makes sense to assume an exponentially distributed lifespan of an individual, such that the model becomes truly continuous. 3 Remark: We note that the exponential distribution has the following property: the waiting time to the ﬁrst event of K independent exponential distributions with rates β1 , . . . , βK is again distributed exponentially with rate β = βk . The particular event can then be chosen with probability βk /β. Hence, sampling from independent exponential distributions and then choosing the fastest is equivalent to ﬁrst sample from a single exponential and then choosing among the events proportianal to their rates. The total rate of events, either birth/death or mutation is then N (1 + µ), because each individual may either die or mutate with rate 1 and µ respectively. Then we pick one individual randomly and let it mutate with probability µ/(1 + µ), and die otherwise. As with the Moran model, we can thus proceed from step to step and add the time only subsequently, assuming exponentially distributed times. Hence, for time independence of model parameters (N and µ), the model again proceeds from step to step. 2.2.1 Equilibrium Distribution If we set K = 1, i.e.,, model just a single biallelic locus, we can obtain some approximate analytical results with the Moran model. We will, however, obtain exact results with the TCT model and leave it to the reader to show that the results hold approximately for the Moran model. Assume a mutation rate of µ1 from allele 2 to allele 1 and µ2 for the reverse direction and set θ1 = N µ1 and θ2 = N µ2 and θ = θ1 + θ2 . Consider a table as in the table in Section 2; let the ﬁrst column represent the eﬀect of only drift, the second column of only mutation, and the third column of both : State x+1 x x−1 Drift only 1/N 2 x(N − x) 1 − 2/N 2 x(N − x) 1/N 2 x(N − x) Mutation only 1/N 2 θ1 (N − x) 1 − 1/N 2 (θ1 (N − x) + θ2 x) 1/N 2 θ2 x both Drift and Mut. 1/N 2 (x(N − x) + θ1 (N − x)) 1 − 1/N 2 (2x(N − x) + θ1 (N − x) + θ2 x) 1/N 2 (x(N − x) + θ2 x) , (since the birth/death and mutation events are mutually exclusive, their probabilities can be added). The equilibrium distribution is a beta-binomial distribution with α = θ1 and β = θ2 : Pr(x|θ, N ) = N! Γ(θ) Γ(x + θ1 )Γ(N − x − θ2 ) . x!(N − x)! Γ(θ1 )Γ(θ2 ) Γ(N + θ) (3) It is actually easier to show that under these assumptions detailed balance holds, i.e.,, that the ﬂow from x to x + 1 is the same as that from x + 1 to x. The transition probability from t to t + 1 is Pr(x + 1 | x, θ, N ) = 1/N 2 (N − x)(x + θ1 ); that from t + 1 to t is Pr(x | x + 1, θ, N ) = 1/N 2 (x + 1)(N − x − 1 + θ2 ) (compare to Table 2.2.1), for any x < N . Under detailed balance, we thus have: Pr(x + 1 | x, θ, N ) Pr(x | θ, N ) = Pr(x | x + 1, θ, N ) Pr(x + 1 | θ, N ) Γ(x + 1 + θ1 )Γ(N − x − 1 + theta2 ) Γ(x + θ1 )Γ(N − x + θ2 ) = (x + 1)(N − x − 1 + θ2 ) (N − x)(x + θ1 ) x!(N − x)! (x + 1)!(N − x − 1)! Γ(x + 1 + θ1 )Γ(N − x + θ2 ) Γ(x + 1 + θ1 )Γ(N − x + θ2 ) = . x!(N − x − 1)! x!(N − x − 1)! (4) Note that detailed balance is stronger than the existence of an equilibrium distribution, i.e.,, if detailed balance holds, an equilibrium distribution exists, but if an equilibrium distribution 4 exists, detailed balance may not hold (the ﬂow in equilibrium may go in a circle from A to B to C and back to A). It follows therefore that the equilibrium distribution is a beta-binomial distribution with α = θ1 and β = θ2 . 2.3 2.3.1 Exercises Extension to parent independent mutation model Assume K alleles and parent-independent mutation: from any allele k, all K alleles are reached with a mutation rate of µ1 , . . . , µK . The mutation rate from each allele is µ = µk . For ease of notation set θk = N µk and θ = θk . Generalize the equilibrium distribution from two to K alleles and show that detailed balance holds also for this model. 3 The Coalescence under Neutrality (Introduction) Generally our emphasis will be on the analysis of real data. This was simpliﬁed by the introduction of the coalescence (Kingman, 1982; Hudson, 1983b,a; Tajima, 1983), and the subsequent theoretical work. In the coalescence, we are given a sample from extant time and look backward to the ancestors of this sample. 3.1 No mutation, TCT and Moran models Let us consider the TCT (or Moran) model without mutation and consider the extant sample (of size N ). Looking backward in time, at each event, a single individual is chosen (with probability 1/N ) to die and replaced by another. The number of ancestors in the previous event is thus reduced by one with probability (N − 1)/N ; only if the individual is chosen to replace itself (with probability 1/N the number of ancestors stays the same. We will term the reduction in numbers of ancestors by drift coalescence. Let us generally consider a sample of size n, with 2 ≤ n ≤ N ; the number of ancestors of this sample in the previous event is reduced by one if any of the n is chosen and if the individual with which it unites is also from the sample, but not the individual itself. Thus the probability of a reduction of ancestors per event in the TCT and Moran models is n(n − 1)/N 2 . With the Moran model this makes for a geometric distribution of coalescence probability with rate n(n − 1)/N 2 . With the TCT model, we have an exponential distribution with the same rate. With both models, we could rescale the time. 3.2 No mutation, Wright-Fisher model We need to assume that n is much smaller than N . Then only the probability of two individuals uniting in the previous generation is appreciably, whereas the probabilities of three individuals or two pairs of individuals uniting is negligible. The probability of two individuals coalescing in the previous generation is equal to the number of possible pairs in the sample n(n − 1)/2 times the probability of the two uniting in the previous generation 1/N . Remark: With the Wright-Fisher model, the rate of coalescence per generation is higher by a factor of two in the Moran model. This factor of two corresponds to that in the reduction of heterozygosity, if generation times are scaled appropriately. The diﬀerence is explained by the fact that only one individual is chosen to die in the Moran model but two individuals need to unite in the Wright-Fisher model. 5 3.3 The coalescence with mutation Irrespective of the exact mutation model and the genotype of the individuals , we can model a coalescence with mutation at constant rate. Suppose, as before, a sample of size n with scaled mutation rate θ = N µ. Then, looking backward, we choose with proportion n(n − 1) a coalescence (we thus assume the Moran or TCT model; otherwise, we would divide by 2) and with proportion nθ/2 a mutation. We simulate backward until we have reached a sample size of 1. (Mutations or coalescences occurring before this time do not inﬂuence the variability in the sample.) Thus, for each n, the number of mutations occuring while n haplotypes (or individuals) are present is geometrically distributed (with p = (n − 1)/(n − 1 + θ)). The total number of mutations until a sample size of n = 1 is reached is therefore a sum of n − 1 geometrically distributed variables. 3.4 Reversal of the Coalescence As long as the parameters N (and later θ) are time-independent, we can reverse the process of the coalescence. We then start with two individuals, i.e.,, i = 1 and work “upwards”, such that we run a geometric distribution for mutations with p = i/(i + θ), then add another individual, and iterate until we have reached n. The advantage is that we can now simulate conditional on the ancestral state Ethier and Griﬃths (compare 1987); Donnelly and Kurtz (compare 1996). Simulation of a biallelic coalescence The following little R-program simulates a biallelic sample of size N with mutation rate θ in both directions. # # # # # # coal_2.R coalescence model simulation biallelic model N sample size theta=1 #scaled mutation rate N=20 inds=rep(-1,N+1) #initialize to impossible value! size= N+1 and discard last inds[1]=inds[2]=0 #set first two inds to possible value = zero or one for(s in 2:N){ #get number of mutations from the geometric distribution nmut=rgeom(1,s/(s+theta)) samples=sample(size=(nmut+1),x=seq(1,s),replace=TRUE) if(nmut>0){#to protect against counting down in for loop for(m in 1:nmut){#R is too smart if(inds[samples[m]]==0){ inds[samples[m]]=1 } else{ inds[samples[m]]=0 } } } 6 inds[s+1]=inds[samples[nmut+1]] #add individual } inds[1:N] #sample size only N 3.5 Watterson’s θ Suppose we could detect every mutation that segregates in the sample (i.e., that is polymorphic). The mean and the variance of this number could then be calculated, since the expectation of the geometric distribution is (1 − p)/p, the expected potential number of segregating mutations in the sample m is in total: 1 1 E(m | n, θ) = i=n−1 θ/i = θ i=n−1 1/i (5) and the variance: 1 1 1 Var(m | n, θ) = i=n−1 θ(i + θ)/i2 = θ i=n−1 1/i + θ 2 i=n−1 1/i2 (6) We could then estimate θ using θw = m/ 1 i=n−1 1/i. This estimator of θ was introduced by Watterson (1975) in the context of DNA sequence polymorphism. We will discuss this in depth later. With the TCT model, we obtain the above results exactly, irrespective of N (as long as n ≤ N , which is trivial), with the Moran and Wright-Fisher model in the limit of N → ∞ and N µ → θ (up to a factor of two or four). 3.6 The Coalescence with Parent Independent Mutation In the following subsections we will introduce diﬀerent mutation models. Let us now consider parent independent mutation where the scaled mutation rate to each of K alleles is θk and that from each allele is θ = θk . (This model is also called the K-allele model.) With parent independent mutation, we gain no information on the parental type from the state of the oﬀspring. Thus, not only with a coalescence do we lose an individual, but also with a mutation. Hence, the coalescence with mutations ﬁnishes in n − 1 steps. Instead of a geometric distribution of mutations at each n, now there is just a Bernoulli. With probability θ/(i + θ) a mutation occurs; otherwise a coalescence. The mean number of mutations in the sample is then 1 E(m | n, θ) = i=n−1 θ/(i + θ) (7) (or this value plus one, if we count the original type as mutation too). For n → ∞, one can show that E(m | n, θ) ≈ θ log n) (where we counted the plus one). The variance is: 1 1 Var(m | n, θ) = i=n−1 θ · i/(i + θ)2 = θ i=n−1 i/(i + θ)2 ; (8) for large n, Var(m | n, θ) ≈ θ log n as i/(i + θ) then goes to 1. 7 3.7 The Coalescence with Inﬁnitely Many Alleles The coalescence with the inﬁnite allele model (IAM) follows from the parent independent mutation model with θ1 = · · · = θK as K goes to inﬁnity. Then every new mutation can be recognized, but gives no information about its parent. Biologically, it can be motivated by the fact that with moderate mutation rates nearly any mutation will lead to a new mutation as, on average, just 0.001 or less sites are polymorphic in most equkaryotes and mutation rates are multiple hits by mutation are thus unlikely. If this situation is modeled with the IAM, information on parents is lost, i.e.,, information is wasted. With the IAM, it can be shown that the number of diﬀerent alleles in the sample m is a suﬃcient statistic of θ. The distribution of m is: Pr(m | θ, n) = S(n, m)θ m /(n + θ)(n − 1 + θ) · · · (θ) (9) where the constant S(n, m) is the absolute value of the Stirling number of the ﬁrst kind. Conditional on m, the distribution of individuals in the m classes is (no derivation): Pr(n1 , . . . , nK | m) = 1/S(n, k) n! K!n1 · · · nk (10) where the alleles are numbered arbitrarily from 1 to m and nk is the number of individuals of allele k (Tavar´ and Ewens, 1997). This distributiin is independent from θ from which it follows e that K is a suﬃcient statistic. The mean and variance of K are given in equations 7 and 8. This implies a rather slow convergence to the true value for the estimator K/ log n for θ with increasing n! 3.8 Limit of High Recombination Rates; One Population, Constant θ In the limit of high recombination rates (i.e., assuming linkage equilibrium), coalescence of neighboring sites is independent. We can then consider each polymorphic site separately and obtain the distribution for the whole data sample by multiplication over sites. Furthermore, since per basepair mutation rates are low (about 10−9 ) and the regions considered are short, the relevant parameter region is the limit of small θ. If the genomic DNA of a sample of many individuals of a population is sequenced using deep sequencing, the information on linkage is almost completely lost, such that assuming complete linkage equilibrium makes very eﬃcient use of the data. In the following, we will only consider the case where the ancestral state is known, i.e., where a suitable outgroup is available, and we will assume that θ is constant over time. This assumption can be dropped later. Consider a sample of size n from one locus; the data are then completely characterized by the number of derived sites (or haplotypes) y, which must be between 0 and n. The case of n derived sites is not only inﬂuenced by the population in focus, but also by the timespan to the outgroup. Hence we will condition on 1 ≤ y < n. The n − 1 coalescences provide slots for mutations to occur. Introduce the variable s that indicates the slot in which the mutation occurs, e.g., s = 0 for no mutation, s = n − 1, for a mutation before the ﬁrst coalescence, and so on until s = 1 is reached. (The order is reversed since we are looking backwards in time). Given θ, the probability of the mutation to occur in the sth slot is: θ Pr(s | θ) = s+θ n−1 i=1 i . i+θ (11) 8 In the limit of small θ, the probability of m mutations is approximately proportional to: n−1 m Pr(m | θ) ∝ θ i=1 1/i (12) In the limit of small θ, the probability of two or more mutations is practically zero and the n−1 probability of a single locus (or SNP) to be monomorphic is thus 1/(1 + θ i=1 1/i) and that n−1 n−1 of it being polymorphic is (θ i=1 1/i)/(1 + θ i=1 1/i). Given m = 1 and in the limit of small θ, we have for 1 ≤ s ≤ n − 1: Pr(s | m = 1) ∝ 1 . s (13) We see that the distribution of s conditional on m = 1 is independent of θ as long as θ is small. Thus, no information can be gained from knowing when a mutation has occurred; the only information is whether or not one has occurred. Distribution of y in the limit of small θ. The allele distribution y is independent from θ conditional on m = 1 and s. Therefore y also has to be independent from θ for small θ. For inference of θ in the limit of small θ under neutrality the distribution of y is therefore noninformative. In this paragraph, we nevertheless give this distribution. The probability to obtain the spectrum y given s is given by the Binomial-Beta compound distribution (or Polyaurn distribution). Adjusted to our variables, this is: Pr(y | n, s) = (y − 1)!(n − 2 − (y − 1))! (n − 1 − s) s! . (s − 1)!0! (n − 1)! y−1 (14) Combining formulas (13) and (14), we get: n−1 Pr(y | θ) = s=1 Pr(y | s) Pr(s | θ) ∝ 1/y . (15) This result and the independence of the distribution of the allelic spectrum y from θ, in the limit of small θ, has been obtained numerous times before (e.g., Sawyer and Hartl, 1992). Let us now proceed with the inference of θ from data in the limit of small θ. Assume we have altogether L loci. The distribution of polymorphic loci will then follow a binomial distribution with p polymorphic and L − p non-polymorphic sites. (We note again that p is a suﬃcient statistic for small θ; nothing can be gained from the allelic spectra of the loci). n−1 By our assumptions, θ( i=1 1/i) ≪ 1 such that the expectation of p is much less than L − p n−1 and we can approximate the binomial by a Poisson distribution with mean Lθ( i=1 1/i)/(1 + n−1 n−1 θ( i=1 1/i)) ≈ Lθ( i=1 1/i), i.e.,: n−1 Pr(p | θ, L, N ) ∝ θ p exp(−θ( i=1 1/i) L) . (16) n−1 The ML estimate of θ is p/(L · i=1 1/i). This is the same result as that given by Sawyer and Hartl (1992). Obviously, the conjugate prior for θ is a scaled gamma distribution with some α and β Pr(θ | α, β) ∝ θ α−1 exp(−θ β) L) , 9 (17) such that the posterior is again a scaled gamma distribution: n−1 Pr(θ | p, L, n, α, β) ∝ θ p+α−1 exp(−θ(β + ( i=1 1/i) L)) . (18) The argument can be generalized to loci with variable n(l), where l indexes loci (1 ≤ l ≤ L); the posterior distribution of θ will remain a scaled gamma: L n(l)−1 Pr(θ | p, n(l), α, β) ∝ θ p+α−1 exp(−θ(β + ( l=1 i=1 1/i))) . (19) 3.9 3.9.1 Exercises Coalescence with the K-sites model Modify the program for simulating from the coalescence to a K-sites mutation model. 3.9.2 The Polya-urn distribution Prove formula 14 using induction on n. 3.9.3 Continuous Time Sampling Extend the discrete coalescence sampling in section 3.4 to continuous time. The continuous version has been introduced and used by Stephens and Donnelly (2000). 4 4.1 The Inﬁnite Sites Models without recombination The K-SitesModel In this subsection, we will study the K-sites and the extension to inﬁnite sites models in more detail. These are obviously models for DNA sequence evolution, more speciﬁcally for single basepair polymorphisms (SNPs). Generally, mutation is assumed to be independent in each of the K sites and toggles between zero and one with a rate of µ per site. With K sites, we thus have 2K possible haplotypes. One can number them in diﬀerent ways. Obviously, a mutation enables transitions only between haplotypes that are diﬀerent at exactly one base, such that mutationonal transitions between many haplotypes require intermediates. Hence, it can be shown immediately that detailed balance cannot hold: consider a transition of a certain haplotype i between xi = 0 and xi = 1 that is oﬀset by a change of in abundance of haplotype j from xj + 1 to xj and suppose that these two haplotypes are separated by more than one mutation. For detailed balance, we require that the ﬂow between these states is equal. Yet drift may get rid of the one xi but cannot restore it. Hence the ﬂow cannot be balanced. Nevertheless, as we already showed in our previous examples, sampling is easily possible with the K-sites model with the Moran and the CTC forward model. It is also easy with the coalescence model, both backward in time (i.e., a “real” coalescence) or forward in time, i.e., some form of urn model. With real datasets, we often have an “outgroup” available, i.e., a sequence from a closely related species that allows us to polarize our polymorphisms in ancestral and derived. 10 4.2 The Inﬁnite SitesModel The inﬁnite sites (IS) model is by far the most important mutation model. In the previous section, we already dealt with a special case, the high recombination rate limit (which corresponds to the low mutation rate limit). In this section, we will devote our attention to the other limit of no or negligible recombination rate. This is not motivated by genetics , but by convenience. Eﬀective recombination rates are of the same order or higher than eﬀective mutation rates. But intermediate recombination rates (same order of magnitude as mutation rates) are tough to handle. Usually, the per-site scaled mutation rate θ = N µ (up to a factor of two or four) is on the order of 10−2 to 10−6 and sequences are quite long (500 bps or longer). Thus, the per site mutation rate is so low that multiple hits of the same site in the sample are virtually impossible and we can detect every mutation as polymorphism in the sample. This leads to the inﬁnite site (IS) model. As with most other mutation models, it is easier to simulate with the coalescence from the bottom up, conditional on a founder haplotype than vice versa. Again, coalescence (urn) simulation is approximate for the Wright-Fisher and Moran model and exact for the TCT model. 4.3 4.4 Inference with the Inﬁnite Sites Model ˆ Watterson’s θw , once more With the inﬁnite site (IS) model, the number of polymorphic sites corresponds to the number of mutations in the sample. We can thus count the number of polymorphic sites and use ˆ Watterson’s θw as estimator of θ. With low θ, this makes very eﬃcient use of the data (as we realized when we considered the high recombination rate limit). With high θ, we gain information on θ from the distribution of mutations. If, in a sample of n, all mutations happened when only two haplotypes were present, θ will be on average smaller than when all mutations happened when already all n haplotypes were present. This is because when only two or a few haplotypes are present the variability in the length of the branches is highest. If θ is high and these variable branches are by chance short, the number of polymorphic sites in the sample may be low. The converse is true, the other way round. To see how much diﬀerence this makes, we will assume that we know that all mutations in a sample of n were before and after all coalescences and calculate the likelihood with the following program: # coal_3.R # # The inefficiency of Watterson’s theta # n=20 #sample size S=25 #polymorphic sites vtheta=seq(0,50,0.1) xlow=rep(-999,length(vtheta)) xhigh=xlow vi=seq(1,(n-1),1) for(i in 1:length(vtheta)){ theta=vtheta[i] dconst=sum(log(vi/(vi+theta))) #the log of product of coalescences 11 xlow[i]=dconst+S*log(theta/(1+theta)) #the log of product of mutations at n=2 xhigh[i]=dconst+S*log(theta/(n+theta)) #the log of product of mutations at n=sample size } thetaw= S/sum(1/vi) plot(vtheta,xlow,type=’l’) lines(vtheta,xhigh,lty=2) abline(v=thetaw) ˆ Oviously, θw is striking a good compromise, but loses all information on the timing of mutations relative to coalescences. We see that we could improve inference if we knew exactly when (in relation to the coalescences) the mutations occurred. But even with the IS model, this is not possible. Nevertheless, polymorphisms that occur in a single sampled individual are likely to have occurred relatively recently (looking backwards in time), whereas if the sample is divided by a single deep split into two groups separated by many mutations, mutations must have occurred relatively far backwards. In the next part, we will make complete use of the information in the IS model. 4.5 Likelihood Given a sample of n haploid individuals that evolved according to the IS model, we realize that the individuals can be organized as “boxes within boxes”, such that we can construct the “unrooted gene tree”, i.e., we can build a hierarchical tree of relatedness among individuals on which we can distribute the mutations. With an outgroup, we can root the tree to obtain the “rooted gene tree”. In the following, we will consider only a rooted tree; the case of the unrooted tree follows by summation over all possible roots weighted by their probabilities. Consider a data set with n haploid individuals (or haplotypes) containing m mutations evolving under the IS. Assume an outgroup, and determine the rooted gene tree. Obviously, there are altogether S = n + m steps to the root of the tree. There are many possibilities (call these coalescence trees) to reach this root, by rearranging the sequences of mutations and coalescences, even though the unrooted gene tree is given by the data. A naive enumeration of all these coalescence trees and then to sum over their respective probabilities is tedious and computationally impossible, except in the most trivial data sets. We will thus proceed by using the Markov property of the coalescence and sum relatively eﬃciently over the coalescence trees. 4.5.1 “Summing out” the coalescence trees using the forward algorithm Consider each individual separately. Obviously, even before observing anything, each individual has just two possibilities in the previous step: it may either mutate or coalesce with the respective probabilities of θ/(n(n − 1) + nθ) and (n − 1)/(n(n − 1) + nθ). Given the rooted gene tree, only zero or one possibilities remain: it may either coalesce, if there are altogether k individuals of its type with transition probability k(k − 1)/(n(n − 1)+ nθ), or it may mutate, if it is separated by a mutation from the next junction in the rooted tree with transition probability θ/(n(n − 1) + nθ), or it remains without a change if there are no other individuals of its type and it is not separated from the next junction by a mutation. If we go through all n individuals, i.e., at the end of the ﬁrst (and before the second step), we get get quite a number of possible conﬁgurations indexed by k, ck (s = 1) and their conditional probabilities fk (s = 1). In following steps, we proceed as in the ﬁrst step and go through all individuals of each possible conﬁguration. We can now get the same conﬁguration by diﬀerent sequences of elementary 12 events. Since they are mututally exclusive, we sum their probabilities. After S stelps we will eventually end at the root and obtain the probability of the rooted gene tree given the data. In detail, the algorithm can be summarized as follows: • Initialization (s = 0): Set the conﬁguration at s = 0 to the data set and its probability to unity, i.e., f1 (0) = 1; • Recursion (s = 1, . . . , S − 1): Generate all conﬁgurations compatible with the data that are one step below the current conﬁgurations (i.e., have one fewer haplotype or one fewer mutation, but not both), index them by K, with 1 ≤ k ≤ K, and calculate their probabilities using: fk (s + 1) = l fl (s)plk , where plk = Pr(cl (s) | ck (s − 1), θ) are the transition probabilities. • Termination: The likelihood is Pr(y | θ) = f1 (S). This algorithm is a variant of the forward algorithm employed in the theory of hidden Markov processes. The number of possible conﬁgurations for intermediate s can be quite large. Nevertheless, summation from one step to the next is vastly more eﬃcient than evaluating all possible paths until the ﬁnal coalescence. We note that fk (s) cannot be interpreted as the probability of the conﬁguration at the step. Only the ﬁnal f1 (S) corresponds to the probability of the sum of all coalescence trees, i.e., the likelihood Pr(y | θ). 4.5.2 Example To illustrate the method, consider a toy data set with only one segregating site: ID Sequence # 1 1 3 2 0 2 where 1 denotes the derived state. Then we have the following sequence of possible conﬁgurations ck (s) of the two haplotypes (numbers of haplotypes in parentheses). Step 0 1 1 2 2 3 3 4 5 ID c1 (0) c1 (1) c2 (1) c1 (2) c2 (2) c1 (3) c2 (3) c1 (4) c1 (5) haplotype frequencies (2,3) (1,3) (2,2) (1,2) (2,1) (1,1) (3,0) (2,0) (1,0) Denote a tenth type as the one that is incompatible with the data. We see that we can move, e.g., in the ﬁrst step, from c1 (0) to either c1 (1), with probability proportional to 2 · 1/2 = 1, or f2 (1), with probability proportional to 3·2/2 = 3. The proportionality constant is the probability of moving to any type, i.e., in both cases 5 · 4/2 + 5θ = 10 + 5θ. Hence, the probability of moving to the incompatible state is (6 + 5θ/2)/(10 + 5θ/2) The transition matrix for the conﬁgurations 13 (c1 (0), . . . , c1 (5)) to conﬁgurations (c1 (0), . . . , c1 (5)) then is: 0 0 0 0 0 0 0 0 0 0 1/(10 + 5θ/2) 0 0 0 0 0 0 0 0 0 3/(10 + 5θ/2) 0 0 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 0 0 1/(3 + 3θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 (θ/2)/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 0 (θ/2)/(1 + 2θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 0 1/(1 + 2θ/2) 1 0 (6 + 5θ/2)/(10 + 5θ (5 + 4θ/2)/(6 + 4θ (4 + 4θ/2)/(6 + 4θ (2 + 3θ/2)/(3 + 3θ (2 + 2θ/2)/(3 + 3θ (1 + θ/2)/(1 + 2θ/ (2 + 3θ/2)/(3 + 3θ (2θ/2)/(1 + 2θ/2 0 1 (20) Since the likelihood of the type incompatible with the data is 0, we can simply eliminate the tenth row and column from the transition matrix to obtain: 0 0 0 0 0 0 0 0 0 1/(10 + 5θ/2) 0 0 0 0 0 0 0 0 3/(10 + 5θ/2) 0 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 1/(6 + 4θ/2) 0 0 0 0 0 0 0 0 0 1/(3 + 3θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 (θ/2)/(3 + 3θ/2) 0 0 0 0 0 0 0 0 0 (θ/2)/(1 + 2θ/2) 1/(3 + 3θ/2) 0 0 0 0 0 0 0 0 1/(1 + 2θ/2)) 1 (21) . 4.6 The Ineﬃciency of Watterson’s θ Even with the inﬁnite site model, we rarely know for sure when, in relation to the coalescences, mutations occurred. We, however, have much more information available than with a parentindependent mutation model. Consider a situation, where we have the same number m of polymorphisms present in the data. If the data contain exclusively singletons, i.e., each mutation is only present in a single haplotype, the mutations likely occurred recently. If, on the other hand, two groups of haplotypes are separated by polymorphisms shared within a group, the mutations likely occurred far back in time. As above, we reason that if, in a sample of n, all mutations happened when only two or very few haplotypes were present, θ will be on average smaller than when all mutations happened when already all n haplotypes were present. This is because when only two or a few haplotypes are present the variability in the length of the branches is highest. If θ is high and these variable branches are by chance short, the number of polymorphic sites in the sample may be low. The converse is true, the other way round. The following constructed example shows that this may make a diﬀerence: 14 Posterior 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0 5 10 Theta 15 20 Here, the two solid lines are the posterior distributions from the mock examples from above. The black solid line represents all mutation events at the tips, the black stippled line all mutation events at the root, the red solid line all two diﬀerentiated groups, and the black stippled line all singletons. We see that, in this extreme constructed case, we gain some information with the likelihood method over Watterson’s estimate of θ. Nevertheless, the gain will be much less in real data sets. 4.7 4.7.1 Exercises Number of Roots Caclulate the number of rooted trees given an unrooted coalescence tree with mutations. 15 4.7.2 Reversal of Time, Probabilities of ck (s) Find an algorithm that traverses from the root of the rooted gene tree up to the extant data set, again step by step, denoting the quantities corresponding to fk (s) with bk (s). Call this the backward algorithm. Find the probability of ck (s) using fk (s), bk (s), and the likelihood f1 (S). 5 Population Demography Generally, populations do not conform to the model of large and constant population sizes over appreciable times. Rather, population sizes grow and shrink. This makes the scaled mutation rate θ a function of time, i.e., θ(t). Other than that, the coalescence proceeds as before. The states are still Markov, i.e., the next coalescence/mutation depends only on the previous state and not on what has gone on before. But this Markov property of states does not imply a Markov property of times. 5.1 One Change in θ Let us consider a very simple model: one change from θc to θa (usually by a change in eﬀective population size N ) at time tc , which we assume conveniently to be scaled by Nc . If we consider a coalescence model (backwards in time), the times of the ﬁrst coalescence or mutation are distributed exponentially with rate (n(n − 1) + nθc ) until tc and with (n(n − 1) + nθa ) thereafter. If we prefer a discrete time model, the exponential distribution would need to be replaced by a geometric one. After the ﬁrst coalescence, the same process continues until tc for the second, and so on. We thus see that we need to explicitely reinstate time for the transition probabilities between the coalescence states. In practice, this means one more loop in the programs, e.g., for the inﬁnite site model with free recombination: TwoNe=1000 TMAX=10*TwoNe N=10 #samplesize nu=1/TwoNe # Generate the y-Matrix y=matrix(rep(0,(TMAX+1)*N),nrow=TMAX+1,ncol=N) y[1,]=c(1,rep(0,N-1)) for(t in 2:TMAX){ y[t,1]=(1-choose(N,2)*nu)*y[t-1,1] for(i in 2:(N-1)){ y[t,i]=(1-choose(N-i+1,2)*nu)*y[t-1,i]+choose(N-i+2,2)*nu*y[t-1,i-1] } y[t,N]=y[t-1,N]+choose(2,2)*nu*y[t-1,N-1] } y[TMAX+1,]=c(rep(0,N-1),1) plot(y[,1],ylab=’Probabilities’,type=’l’,xlim=c(0,TMAX+1)) for(i in 2:N){ lines(y[,i]) } 16 #Get Polya probs pPolya=matrix(rep(0,(N-1)*(N-1)),nrow=N-1) for(s in (N-1):1){ for(x in 1:(N-s)){ pPolya[N-s,x]=choose(N-1-s,x-1)/beta(s,1)*beta(N-x,x) } } j=seq(0,TMAX/10,1) pmat=matrix(rep(0,(N-1)*length(j)),ncol=(N-1)) for(k in j){ tchange=k*10 #tchange #Multiply with the theta ratio and the number of mutation possibilities theta_ratio=0.1 pS=rep(0,N) for(i in 1:N-1){ pS[i]=(N-i+1)*(sum(theta_ratio*y[1:tchange,i]) +sum((y[(tchange+1):length(y[,i]),i]))) } pS=pS/sum(pS) #Get posterior for(s in (N-1):1){ for(x in 1:(N-s)){ pmat[k+1,x]=pmat[k+1,x]+pS[N-s]*pPolya[N-s,x] } } pmat[k+1,]=pmat[k+1,]/sum(pmat[k+1,]) } # pmat is the result of all these calculations pmat 5.2 5.2.1 Exercises Transition Matrix Modify the transition matrix in formula 21 to the “one change in θ” model. 5.2.2 Exponential Growth In the literature, exponential growth models (forward in time) from some Na to Nc within a time of t are popular. How many parameters are needed for this model, if exponential growth continues to the presence, and how many are needed if exponential growth occurs for an interval that ﬁnishes before the present time? Modify the above program of stepwise increase/decrease in population size to one with exponential increase, where the exponential increase/decrease has not yet ﬁnished. 17 6 Structured Populations A species may be mainly homogenous, i.e., panmictic, or subdivided into demes. The earlier analyses apply strictly only in the former case. The latter case of population subdivision will be considered in this chapter. The most inﬂuential model of population subdivision is Wright’s island model. 6.1 The Inﬁnite Island Model Without Mutation A simple case is the inﬁnite island model, where it is assumed that inﬁnitely many demes persist indeﬁnitely (this assures that allele frequencies do not change with time in the absence of mutation), each with the same population size N (or more general and in accord with the literature: 2Ne ), exchange m migrants (or 2m gametes if individuals are assumed diploid) per generation in such a way that each population contributes to a common migrant pool out of which the migrants are drawn (i.e., there is no eﬀect of neighborhood). Since all demes are equivalent, we can pick a particular one. For simplicity we assume a biallelic locus, where the allele frequency of one allele is p for all times (absence of mutation and drift in the general population, because of the inﬁnite number of demes); extension to multiallelic loci is straightforward (generalization from the binomial-beta to the multinomial-Dirichlet distribution). 6.1.1 One population, p given Again, we can either model with a continuous time model, or with discrete generations. For simplicity, we choose the continuous time model. As before, in the case of the biallelic mutation model, individuals die and are replaced by other individuals at rate of 1/N 2 . At each migration step, a random individual is chosen out of the N individuals within the deme for being replaced by a migrant. Set γ = N m. (Attention: in the context of subdivided populations θ is used diﬀerently than in the context of mutation!). We construct a table as above (Table 2.2.1)— birth/death and migration events are mutually exclusive such that their probabilities can be added: State x+1 x x−1 both Drift and Mut. − x) + γp(N − x)) 1 − 1/N 2 (2x(N − x) + γp(N − x) + γ(1 − p)x) 1/N 2 (x(N − x) + γp(N − x)) , 1/N 2 (x(N The equilibrium distribution is again a beta-binomial distribution with α = γp and β = γ(1 − p): Pr(x|γ, N ) = Γ(γ) Γ(x + γp)Γ(N − x − γ(1 − p)) N! . x!(N − x)! Γ(γp)Γ(γ(1 − p)) Γ(N + γ) (22) As above, it can be shown that detailed balance holds, i.e.,, that the ﬂow from x to x + 1 is the same as that from x + 1 to x. Thus the biallelic inﬁnite island or migration-drift model is formally equivalent to the biallelic mutation-drift model. We are, in this case, mainly interested in γ. The distribution of γ is non-standard. We note that the Beta-binomial distribution can be broken up into two distributions: sample π, the proportion of the focal allele in the population, from a beta(γp, γ(1 − p)) and then sample x from a binomial(N, π). The beta-binomial can be obtained from the two distributions by “integrating out” π. 18 6.1.2 Fst or θ Usually, instead of using γ one parametrizes using Fst = θ = 1/(γ + 1). (Again a warning of the diﬀerent use of θ in the context of migration versus mutation!) The coeﬃcient of coancestry Fst or θ is deﬁned as the probability of coalescence of a pair of random alleles within the population. (The alternative is that one of the pair migrates before coalescence.) How we get there should be clear by now. We note that γ = (1 − θ)/θ. # coal_5.R # # theta=1/gamma+1); gamma=N*m, where m migration coefficient # dbetabin= function(x,N,alpha,beta,log=FALSE){ if(log==TRUE){ lchoose(N,x)-lbeta(alpha,beta)+lbeta(x+alpha,N-x+beta) } else{ choose(N,x)/beta(alpha,beta)*beta(x+alpha,N-x+beta) } } N=10 #sample size y=1 #sample size of first allele p=0.4 theta=seq(0.01,0.99,0.01) plot((1-theta)/theta,dbetabin(y,N,(1-theta)/theta*p,(1-theta)/theta*(1-p)),type=’l’) We note that with more independent loci, estimation of γ improves. The earlier assumption that all demes (or subpopulations) need to have the same eﬀective migration rate γ or equivalently coeﬃcient of coancestry θ is not necessary; each population may have its own. We now need to watch out with the indexing; generally l would be taken to index the loci and i the populations. (If multiple alleles are assumed, a may index the alleles.) 6.1.3 Other Deﬁnitions of Fst Many deﬁnitions of Fst are available around. The above deﬁnition, Fst = θ = 1/(γ + 1), is equal to the probability of two random gametes coalescing within the population before reaching the pool of migrants. Often the following deﬁnition for Fst is given: Fst = 1 − observedheterozygosity , expectedheterozygosity (23) But this is problematic. Rather “observed heterozygosity” should be replaced by expected heterozygosity in the deme or subpopulation. We then have: Fst = 1 − 2 E(π(1 − π) , 2p(1 − p) (24) where π is deﬁned as the proportion of the focal allele in the subpopulation. We note that the distribution of π is beta with α = γp and β = γ(1 − p), such that E(π) = p and Var(π) = 19 p(1 − p)/(γ + 1). It follows that E(π2) = p2 + p(1 − p)/(γ + 1). Substituting into equation (25) gives: 1 2 E(π) − E(π 2 )) =1−1+ . (25) Fst = 1 − 2p(1 − p) γ+1 This result is the same as above. Another deﬁnition for biallelic loci is: Fst = Var(π) , p(1 − p) (26) where π is deﬁned as the proportion of the focal allele in the subpopulation. With the mainland/island model (or equivalent), we have the variance of the Beta distribution: αβ/((α + β)2 (α + β + 1)), such that we get: Fst = 1 γ 2 p(1 − p)/(γ 2 (γ + 1)) = . p(1 − p γ+1 (27) We see that all three deﬁnitions are equivalent. 6.2 p to be estimated, two or more populations We generally do not know p and thus need to estimate it from the data. We then need at least two populations, because otherwise the estimation of γ and p is confounded. We will again just consider a particular locus and thus leave away indexing for the locus but will introduce the index i for the population. We are usually not interested in p, but only in γi . The natural strategy, for the case of p unknown, would need to be to supply a prior for p and “integrate it out”. A natural prior for p would be beta with 0 < α = β ≤ 1. We could use equation (22), subdivide the interval 0 < p < 1 into small enough parts and sum out p. This would be rather straightforward and should work relatively well, since the probability mass of p is usually spread out considerably, such that numerical problems need not be feared. I will, however, introduce a Markov chain Monte Carlo (MCMC) sampling scheme (Vogl et al., 2003) that converges to the posterior distribution. Generally, given the number of alleles of both types, i.e., m and M − m, the likelihood is binomial, such that p is beta distributed after multiplication with the beta prior: Pr(p | x, N, α, β) ∝ pα+m−1 (1 − p)β+M −m−1 (28) The problem with the distribution of p is that m and M are actually random numbers, because we must only count those alleles that make it to the general population, i.e., that do not coalesce within the populations. Suppose that we have y1 out of N1 alleles of the ﬁrst type in the ﬁrst population and y2 out of N2 alleles of the ﬁrst type in the second population. Conditional on γi and p, the probability of coalescence versus migration is independent for each allele and population i. Hence, we have the probability of migration if there are s individuals of the ﬁrst allele: Pr(mi | γ, p, s) = γi p ; s − 1 + γi p (29) the last individual has to migrate; and analogously for the second allele. Iterating over s, starting form s = m, we can draw a sample from mi and similarly for the other allele to obtain M . 20 Conditional on p, yi and Ni , we can update γ. Unfortunately, the distribution of γ is nonstandard. Furthermore, at least to me, an equal prior for θ seems more natural than one for γ, Hence, I would rather reparametrize and then use a Metropolis step to obtain a new θi for each population. Altogether, we then have: # coal_6.R # # theta=1/(gamma+1); gamma=N*m, where m migration coefficient # dbetabin= function(x,N,alpha,beta,log=FALSE){ if(log==TRUE){ lchoose(N,x)-lbeta(alpha,beta)+lbeta(x+alpha,N-x+beta) } else{ choose(N,x)/beta(alpha,beta)*beta(x+alpha,N-x+beta) } } #the parent independent mutation or migration distribution #returns the number of migrants/mutants rparentind= function(n=1,N=1,agamma=1){ if(N<1){ y=rep(0,n) } else if(N==1){ y=rep(1,n) } else{ y=rep(1,n) for(i in (N-1):1){ y=y+rbinom(n,1,agamma/(i+agamma)) } } y } allfreqs=matrix(c(3,8,10,10,8,2,10,10),nrow=2) #y_1,N_1,y_2,N_2 maxit=1000 #maximum number of iterations vresults=matrix(rep(0,maxit*3),ncol=3) theta=rbeta(1,1,1) #sample initial theta from prior p=rbeta(length(allfreqs[,1]),0.1,0.1) #sample initial p from prior for(it in 1:maxit){ ll=0; for(i in 1:length(allfreqs[,1])){ y1=c(rparentind(1,allfreqs[i,1],(1-theta)/theta*p[i]), rparentind(1,allfreqs[i,2]-allfreqs[i,1],(1-theta)/theta*(1-p[i]))) 21 y2=c(rparentind(1,allfreqs[i,3],(1-theta)/theta*p[i]), rparentind(1,allfreqs[i,4]-allfreqs[i,3],(1-theta)/theta*(1-p[i]))) p[i]=rbeta(1,0.1+y1[1]+y2[1],0.1+y1[2]+y2[2]) loglike=dbetabin(allfreqs[i,1],allfreqs[i,2],(1-theta)/theta*p[i], (1-theta)/theta*(1-p[i]),log=TRUE)+dbetabin(allfreqs[i,3], allfreqs[i,4],(1-theta)/theta*p[i],(1-theta)/theta*(1-p[i]),log=TRUE) thetaStar=theta+(0.05-runif(1)*0.1) if((0<thetaStar) && (1 > thetaStar)){ loglikeStar=dbetabin(allfreqs[i,1],allfreqs[i,2],(1-thetaStar)/thetaStar*p[i], (1-thetaStar)/thetaStar*(1-p[i]),log=TRUE)+dbetabin(allfreqs[i,3], allfreqs[i,4],(1-theta)/theta*p[i],(1-thetaStar)/thetaStar*(1-p[i]),log=TRUE) if(loglikeStar>loglike){ theta=thetaStar loglike=loglikeStar } if(runif(1)>exp(loglikeStar-loglike)){ theta=thetaStar loglike=loglikeStar } } ll=ll+loglike; } vresults[it,1]=ll vresults[it,2]=theta } layout(mat=matrix(1:2,2,1,byrow=FALSE)) plot(vresults[,1],type=’l’,xlab="Iterations",ylab="log likelihood") plot(vresults[,2],type=’l’,xlab="Iterations",ylab="theta") layout(mat=matrix(1:1)) 7 Exam Question The basic situation is: a population of inﬁnite size and without mutation where one allele of a biallelic locus has a proportion of p (for all times). A smaller population of constant size N splits from this population at time t in the past. The populations do not exchange any migrants after the split (before they were one panmictic population with free exchange of individuals or gametes). Part 1: Choose a deﬁnition of Fst and calculate Fst (N, t) = f (N, t) (choose either continuous or discrete time). Given N ﬁnd the time t1/2 such that Fst (N, t1/2 ) = 0.5. (Note that only if you choose continuous time, this will in general be possible exactly; if you choose discrete time, take the closest value.) Part 2: Given N , the t1/2 from the previous part, and a p between 0.2 and 0.8, write a program with “R” to sample n haploid individuals (n about 10 or 20) from the subpopulation at time t = 0. Let the random variable y be the number of individuals with allele 1 in the 22 sample. (Remark: there are many possibilities to do this. Any approach covered in this course is allowed; do not use the diﬀusion approach!) 23 References Donnelly, P. and Kurtz, T. (1996). A countable representation of the Fleming-Viot measurevalued diﬀusion. Annals of Probability, 24, 698–742. Ethier, S. and Griﬃths, R. (1987). The inﬁnitely many sites model as a measure-valued diﬀusion. Annals of Probability, 15, 515–545. Fisher, R. (1930). The Genetical Theory of Natural Selection. Clarendon Press, Oxford. Hudson, R. (1983a). Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology. Hudson, R. (1983b). Testing the constant-rate neutral allele model with protein sequence data. Evolution, 37, 203–207. Kingman, J. (1982). On the genealogy of large populations. Journal of Applied Probability, 19A, 27–43. Moran, P. (1958). Random processes in genetics. Proc. Camb. Phil. Soc., 54, 60–71. Sawyer, S. and Hartl, D. (1992). Population genetics of polymorphism and divergence. Genetics, 132, 1161–1176. Stephens, M. and Donnelly, P. (2000). Inference in Molecular Population Genetics. Journal of the Royal Statistical Society, Series B, 62, 605–655. Tajima, F. (1983). Evolutionary relationship of DNA sequences in ﬁnite populations. Genetics, 105, 437–460. Tavar´, S. and Ewens, W. (1997). Multivariate discrete distributions, chapter The Ewens same pling formula. Wiley. Vogl, C., Das, A., Beaumont, M., Mohanty, S., and Stephan, W. (2003). Population Subdivision and Molecular Sequence Variation: Theory and Analysis of Drosophila ananassae Data. Genetics, 165, 1385–1395. Watterson, G. (1975). On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7, 256–276. Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16, 97–159. 24

DOCUMENT INFO

Shared By:

Categories:

Tags:
population genetics, population size, coalescent theory, genetic drift, natural selection, gene flow, mutation rate, genetic variation, neutral theory, gene trees, molecular evolution, common ancestor, effective population size, species trees, migration rate

Stats:

views: | 17 |

posted: | 12/27/2009 |

language: | German |

pages: | 24 |

OTHER DOCS BY cometjunkie50

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.