# Introduction to Bayesian Phylogenetics Part 1 Introduction to

### Pages to are hidden for

"Introduction to Bayesian Phylogenetics Part 1 Introduction to "

```					    Introduction to
Bayesian Phylogenetics

Simon Ho

Part 1: Introduction to
Phylogenetic Analysis

1
What is phylogenetic analysis?

•   The process of inferring the
phylogeny of a set of taxa
•   The phylogeny refers to the true
evolutionary relationships
underlying a set of taxa
•   We can never know the phylogeny*
but we can estimate it
•   The phylogeny can be inferred from
various types of data, including
morphological and molecular

*With some exceptions

Why do phylogenetic analysis?

•    Two fundamental results:
   Estimate of evolutionary relationships
   Estimate of evolutionary rates and time-scales

•       These provide information for:
    Phylogeography
    Conservation genetics
    Population genetics
    Medicine and epidemiology
    and more …

2
Fundamental assumptions

•   Phylogenetic methods make several fundamental
assumptions:
   Each aligned site represents a set of orthologous characters
   Sites in an alignment evolve independently
   The relationships among the sequences can be represented
by a bifurcating (binary) tree

Popular phylogenetic methods

•   Distance-based methods
   UPGMA
   Neighbour-joining

•   Maximum parsimony
•   Maximum likelihood
•   Bayesian inference

3
Distance-based methods

1.   Calculate distance between each pair of sequences.
This distance can be corrected according to a
chosen model of substitution
2.   Put all of these pairwise distances into a matrix
3.   Use an algorithm to construct a tree from this matrix

Maximum parsimony

1.   Select a bifurcating tree topology

2.   Count the evolutionary steps needed to explain the data
3.   Repeat this for all possible bifurcating trees. The tree
that requires the fewest number of steps to explain the
data is the ‘maximum-parsimony tree’

4
Maximum likelihood

•   Likelihood = Pr(data | tree, parameters)

Maximum likelihood

•   The likelihood value is calculated for each site
•   Multiplied across sites to obtain overall likelihood
•   Likelihood is calculated for different tree topologies,
branch lengths, and model parameter values
•   Tree yielding highest likelihood is the maximum-
likelihood tree
•   Parameter values yielding highest likelihood are
maximum-likelihood estimates

5
Classifying phylogenetic methods

•   Various features can be used to classify phylogenetic
methods:
   How they find the ‘best’ tree
•   Algorithm: follow a series of steps to construct a tree
•   Optimality criterion: calculate a score for each possible
tree and find the tree with the ‘optimal’ score
   Use of DNA or amino-acid
substitution models

Classifying phylogenetic methods

Algorithm-        Optimality
Other
based            criterion

Not
Maximum
substitution
parsimony
model-based

Distance-based
Model-based                              Maximum           Bayesian
methods
likelihood        inference
(e.g., N-J)

6
Problems affecting these methods

•   Distance methods
   Doesn’t use all information in alignment
   Can’t implement sophisticated evolutionary models

•   Maximum parsimony
   Affected by long-branch attraction (doesn’t handle homoplasy well
because it is not substitution model-based)
   Can’t estimate rates or dates

•   Maximum likelihood
   Unable to implement highly parameterised models
   Difficult to obtain a confidence interval for the ML tree

Computational intractability

•   Problem: As number of sequences grows, number of
possible trees grows hyper-exponentially
•   Too many trees for an exhaustive search
•   Solution: heuristic search method
   Don’t look at all possible trees
   Use an algorithm to limit the search to ‘good’ trees
   Start the search from different starting points

7
Part 2: Introduction to
Bayesian Phylogenetic
Analysis

Bayesian inference

•   First applied to phylogenetics in 1997

•   Based on Bayes’s theorem

•   Major Bayesian phylogenetic software includes:
   MrBayes (trees)
   BEAST (trees, rates, and dates)
   multidivtime (rates and dates)

8

•   Parameters have distributions
•   Before the data are observed, each parameter has a
prior distribution
•   The likelihood of the data are computed
•   The prior distribution is combined with the likelihood
to yield the posterior distribution

Bayesian inference

•   Based on Bayes’s Theorem:

Pr(tree,parameters|data) = [ Pr(tree,parameters) x
Pr(data|tree,parameters) ]
Posterior
Prior      ÷ Pr(data)

Marginal probability of data             Likelihood
Summed over all possible parameter
values and tree topologies

9
Bayesian inference

Posterior ∝ Prior x Likelihood

This is what we want                 Calculated from data
to estimate

Specified by user,
Independent of data

Priors

•   Reflect our prior expectations (and uncertainty) about
values of parameters (without knowledge of the data)
•   Priors are chosen in the form of probability
distributions
•   Examples:
   Ratio of transitions to transversions
•   Somewhere between 0 and 100 → Uniform(0,100)
   Substitution rate
•   Probably around 3.2x10-8 → Normal(3.2x10-8 ,σ)

10
Priors

•   But what about a prior for the tree?
•   This can be handled in three ways:
1.   Flat prior on topologies and branch lengths
2.   Flat prior on topologies, but with an arbitrary prior on
branch lengths (MrBayes)
•   e.g., branch lengths follow Exponential(10)
3.   Prior on tree topology and branch lengths (BEAST)
•   Provided by stochastic branching process

Priors

•   Priors can be specified on the following bases:
1.   Use of a biologically realistic model
2.   Past observations
3.   Subjective beliefs

•   What if these are not available?
    Use uninformative/diffuse/vague priors
    Give parameters of priors their own priors
(hierarchical Bayesian analysis)

11

•   Able to implement highly parameterised models
•   Estimating tree uncertainty is straightforward
   Can only do this indirectly in likelihood (bootstrapping)
•   Posterior probabilities have an easy interpretation
   The posterior probability of a clade is the probability
that the clade is correct, given the data and model
•   Can easily integrate over ‘nuisance’ parameters (i.e.,
those that are not of immediate interest)
•   Can incorporate independent information (in the prior)

Problems in Bayesian analyses

•   Sensitivity of the posterior to the prior
   This problem can arise if the data are uninformative

Posterior ∝ Prior           x   Likelihood

12
Problems in Bayesian analyses

•    Overparameterisation
   Simple example:
Trying to estimate the substitution rate and the
divergence time from a pairwise genetic distance
   This problem is not always obvious in the analysis

   Typically higher than bootstrap support values
   Problem needs further investigation

Summary
Maximum likelihood
Probability of?

Given            +                 →

Bayesian inference                         Probability of?

Given                                        +
→

13
Part 3: Markov Chain Monte
Carlo Sampling

Estimating the posterior

•   Remember this?
Posterior ∝ Prior x Likelihood

This is what we want          Calculated from data
to estimate

Specified by user,
Independent of data

14
How to estimate the posterior?

•    Impossible to obtain the posterior directly
•    Instead, posterior can be estimated using Markov
chain Monte Carlo simulation
•    This is usually done using the Metropolis-Hastings
algorithm

Metropolis-Hastings algorithm

1.    Choose a starting tree and parameter vaues
2.    Calculate (prior x likelihood) of current location
3.    Propose a change to one or more parameters/tree
(i.e., a change of location)
4.    Two situations:
1.   If proposed location is better, move to the new location
2.   If proposed location is worse, move to the new location
with probability equal to ratio of new to old location

5.    Record the tree and parameter value at each step

15
Proposing moves

New location better   Accept
than old location     proposed
move
Current location
Accept
Ratio of new location proposed
to old location: 1/3  move with
prob. 1/3

Metropolis-Hastings algorithm

16
Posterior distribution

250

200

150

100

50

Burn-in phase                  Stationary phase
0
0    5       10   15     20   25   30   35   40    45      50

Posterior distribution

•       Take samples every n steps
(e.g., every 100 steps)
•       Discard the first x% of steps as ‘burn-in’
•       Plot a histogram from the remaining samples
•       This provides an estimate of the posterior distribution!

17
Metropolis-coupling

•   Use more than one chain in the analysis
   More willing to go downhill in the landscape
   Act as ‘scouts’

•   If one of the additional chains finds a better location,
it swaps places with the ‘cold’ chain
•   Results in quicker convergence and better mixing
•   Reduces chance of being trapped in local optimum

Output from a Bayesian analysis

•   A list of the parameter values visited by the Markov
chain
   .p file in MrBayes
   .log file in BEAST

•   A list of the trees visited by the Markov chain
   .t file in MrBayes
   .trees file in BEAST

18
Summarising the parameters

•   Take the mean of the sampled values
   This is the mean posterior estimate

•   Take the top 95% of the sampled values
   This is the 95% credibility interval (unimodal)
   or 95% highest posterior density interval (multimodal)

Summarising the trees

•   For each node in the tree, calculate the proportion of
sampled trees in which the node is present
•   For each node, this proportion is the ‘posterior
probability’ of the node
•   Alternative ways of summarising the trees
   Sampled tree with highest posterior probability
→ Maximum a posteriori (MAP) tree
   Sampled tree with highest product of nodal posterior
probabilities
→ Maximum clade credibility (MCC) tree

19
Example results

Key references

•   Felsenstein J (2004) Inferring Phylogenies. Sinauer
Associates.
•   Yang Z (2005) Bayesian Inference. In: Mathematics
of Evolution and Phylogeny (ed. Gascuel O) Oxford
University Press.

20

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 39 posted: 11/21/2008 language: English pages: 20