"Introduction to Bayesian Phylogenetics Part 1 Introduction to "
Introduction to Bayesian Phylogenetics Simon Ho Part 1: Introduction to Phylogenetic Analysis 1 What is phylogenetic analysis? • The process of inferring the phylogeny of a set of taxa • The phylogeny refers to the true evolutionary relationships underlying a set of taxa • We can never know the phylogeny* but we can estimate it • The phylogeny can be inferred from various types of data, including morphological and molecular *With some exceptions Why do phylogenetic analysis? • Two fundamental results: Estimate of evolutionary relationships Estimate of evolutionary rates and time-scales • These provide information for: Phylogeography Conservation genetics Population genetics Medicine and epidemiology and more … 2 Fundamental assumptions • Phylogenetic methods make several fundamental assumptions: Each aligned site represents a set of orthologous characters Sites in an alignment evolve independently The relationships among the sequences can be represented by a bifurcating (binary) tree Popular phylogenetic methods • Distance-based methods UPGMA Neighbour-joining • Maximum parsimony • Maximum likelihood • Bayesian inference 3 Distance-based methods 1. Calculate distance between each pair of sequences. This distance can be corrected according to a chosen model of substitution 2. Put all of these pairwise distances into a matrix 3. Use an algorithm to construct a tree from this matrix Maximum parsimony 1. Select a bifurcating tree topology 2. Count the evolutionary steps needed to explain the data 3. Repeat this for all possible bifurcating trees. The tree that requires the fewest number of steps to explain the data is the ‘maximum-parsimony tree’ 4 Maximum likelihood • Likelihood = Pr(data | tree, parameters) Maximum likelihood • The likelihood value is calculated for each site • Multiplied across sites to obtain overall likelihood • Likelihood is calculated for different tree topologies, branch lengths, and model parameter values • Tree yielding highest likelihood is the maximum- likelihood tree • Parameter values yielding highest likelihood are maximum-likelihood estimates 5 Classifying phylogenetic methods • Various features can be used to classify phylogenetic methods: How they find the ‘best’ tree • Algorithm: follow a series of steps to construct a tree • Optimality criterion: calculate a score for each possible tree and find the tree with the ‘optimal’ score Use of DNA or amino-acid substitution models Classifying phylogenetic methods Algorithm- Optimality Other based criterion Not Maximum substitution parsimony model-based Distance-based Model-based Maximum Bayesian methods likelihood inference (e.g., N-J) 6 Problems affecting these methods • Distance methods Doesn’t use all information in alignment Can’t implement sophisticated evolutionary models • Maximum parsimony Affected by long-branch attraction (doesn’t handle homoplasy well because it is not substitution model-based) Can’t estimate rates or dates • Maximum likelihood Unable to implement highly parameterised models Difficult to obtain a confidence interval for the ML tree Computational intractability • Problem: As number of sequences grows, number of possible trees grows hyper-exponentially • Too many trees for an exhaustive search • Solution: heuristic search method Don’t look at all possible trees Use an algorithm to limit the search to ‘good’ trees Start the search from different starting points 7 Part 2: Introduction to Bayesian Phylogenetic Analysis Bayesian inference • First applied to phylogenetics in 1997 • Based on Bayes’s theorem • Major Bayesian phylogenetic software includes: MrBayes (trees) BEAST (trees, rates, and dates) multidivtime (rates and dates) 8 The Bayesian paradigm • Parameters have distributions • Before the data are observed, each parameter has a prior distribution • The likelihood of the data are computed • The prior distribution is combined with the likelihood to yield the posterior distribution Bayesian inference • Based on Bayes’s Theorem: Pr(tree,parameters|data) = [ Pr(tree,parameters) x Pr(data|tree,parameters) ] Posterior Prior ÷ Pr(data) Marginal probability of data Likelihood Summed over all possible parameter values and tree topologies 9 Bayesian inference Posterior ∝ Prior x Likelihood This is what we want Calculated from data to estimate Specified by user, Independent of data Priors • Reflect our prior expectations (and uncertainty) about values of parameters (without knowledge of the data) • Priors are chosen in the form of probability distributions • Examples: Ratio of transitions to transversions • Somewhere between 0 and 100 → Uniform(0,100) Substitution rate • Probably around 3.2x10-8 → Normal(3.2x10-8 ,σ) 10 Priors • But what about a prior for the tree? • This can be handled in three ways: 1. Flat prior on topologies and branch lengths 2. Flat prior on topologies, but with an arbitrary prior on branch lengths (MrBayes) • e.g., branch lengths follow Exponential(10) 3. Prior on tree topology and branch lengths (BEAST) • Provided by stochastic branching process • More about this in the next talk Priors • Priors can be specified on the following bases: 1. Use of a biologically realistic model 2. Past observations 3. Subjective beliefs • What if these are not available? Use uninformative/diffuse/vague priors Give parameters of priors their own priors (hierarchical Bayesian analysis) 11 Advantages over likelihood • Able to implement highly parameterised models • Estimating tree uncertainty is straightforward Can only do this indirectly in likelihood (bootstrapping) • Posterior probabilities have an easy interpretation The posterior probability of a clade is the probability that the clade is correct, given the data and model • Can easily integrate over ‘nuisance’ parameters (i.e., those that are not of immediate interest) • Can incorporate independent information (in the prior) Problems in Bayesian analyses • Sensitivity of the posterior to the prior This problem can arise if the data are uninformative Posterior ∝ Prior x Likelihood 12 Problems in Bayesian analyses • Overparameterisation Simple example: Trying to estimate the substitution rate and the divergence time from a pairwise genetic distance This problem is not always obvious in the analysis • High clade posterior probabilities Typically higher than bootstrap support values Problem needs further investigation Summary Maximum likelihood Probability of? Given + → Bayesian inference Probability of? Given + → 13 Part 3: Markov Chain Monte Carlo Sampling Estimating the posterior • Remember this? Posterior ∝ Prior x Likelihood This is what we want Calculated from data to estimate Specified by user, Independent of data 14 How to estimate the posterior? • Impossible to obtain the posterior directly • Instead, posterior can be estimated using Markov chain Monte Carlo simulation • This is usually done using the Metropolis-Hastings algorithm Metropolis-Hastings algorithm 1. Choose a starting tree and parameter vaues 2. Calculate (prior x likelihood) of current location 3. Propose a change to one or more parameters/tree (i.e., a change of location) 4. Two situations: 1. If proposed location is better, move to the new location 2. If proposed location is worse, move to the new location with probability equal to ratio of new to old location 5. Record the tree and parameter value at each step 15 Proposing moves New location better Accept than old location proposed move Current location Accept Ratio of new location proposed to old location: 1/3 move with prob. 1/3 Metropolis-Hastings algorithm 16 Posterior distribution 250 200 150 100 50 Burn-in phase Stationary phase 0 0 5 10 15 20 25 30 35 40 45 50 Posterior distribution • Take samples every n steps (e.g., every 100 steps) • Discard the first x% of steps as ‘burn-in’ • Plot a histogram from the remaining samples • This provides an estimate of the posterior distribution! 17 Metropolis-coupling • Use more than one chain in the analysis • Additional ‘heated’ chains: More willing to go downhill in the landscape Act as ‘scouts’ • If one of the additional chains finds a better location, it swaps places with the ‘cold’ chain • Results in quicker convergence and better mixing • Reduces chance of being trapped in local optimum Output from a Bayesian analysis • A list of the parameter values visited by the Markov chain .p file in MrBayes .log file in BEAST • A list of the trees visited by the Markov chain .t file in MrBayes .trees file in BEAST 18 Summarising the parameters • Take the mean of the sampled values This is the mean posterior estimate • Take the top 95% of the sampled values This is the 95% credibility interval (unimodal) or 95% highest posterior density interval (multimodal) Summarising the trees • For each node in the tree, calculate the proportion of sampled trees in which the node is present • For each node, this proportion is the ‘posterior probability’ of the node • Alternative ways of summarising the trees Sampled tree with highest posterior probability → Maximum a posteriori (MAP) tree Sampled tree with highest product of nodal posterior probabilities → Maximum clade credibility (MCC) tree 19 Example results Key references • Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates. • Yang Z (2005) Bayesian Inference. In: Mathematics of Evolution and Phylogeny (ed. Gascuel O) Oxford University Press. 20