Stuart M. Brown
New York University School of Medicine
• A. Molecular Evolution
• B. Calculating Distances
• C. Clustering Algorithms
• D. Cladistic Methods
• E. Computer Software
• The theory of evolution is the
foundation upon which all of
modern biology is built.
• From anatomy to behavior to genomics, the
scientific method requires an appreciation of
changes in organisms over time.
• It is impossible to evaluate relationships among
gene sequences without taking into consideration
the way these sequences have been modified over
Nothing in biology makes sense except in
the light of evolution.
– Theodosius Dobzhansky, 1973
Similarity searches and multiple alignments of
sequences naturally lead to the question:
“How are these sequences related?”
and more generally:
“How are the organisms from which
these sequences come related?”
The purpose of a phylogenetic tree
is to illustrate how a group of
objects (usually genes or organisms)
are related to one another
• The study of the relationships between groups of
organisms is called taxonomy, an ancient and
venerable branch of classical biology.
• Taxonomy is the art of classifying
things into groups — a quintessential
human behavior — established as a
mainstream scientific field by
Carolus Linnaeus (1707-1778).
• Evolutionary theory states that groups of similar
organisms are descended from a common ancestor.
• Phylogenetic systematics (cladistics) is a method
of taxonomic classification based on their
• It was developed by Willi Hennig,
a German entomologist, in 1950.
Cladistics and Phenetics
• Cladistic approach: Trees are drawn based
on the conserved characters
• Phenetic approach: Trees are based on
some measure of distance between the
• Molecular phylogenies are inferred from
molecular (usually sequence) data
– either cladistic (e.g. gene order) or phenetic
• Evolutionary relationships are documented by
creating a branching structure, termed a phylogeny
or tree, that illustrates the relationships between the
• Cladistic methods construct a tree (cladogram) by
considering the various possible pathways of
evolution and choose from among these the best
• A phylogram is a tree with branches that are
proportional to evolutionary distances.
Classes of algorithm used to infer
phylogeny from sequence
• Distance methods
• Probabilistic methods
• Phylogenetics often makes use of numerical data,
(numerical taxonomy) which can be scores for
various “character states” such as the size of a
visible structure or it can be DNA sequences.
• Similarities and differences between organisms can
be coded as a set of characters, each with two or
more alternative character states.
• In an alignment of DNA sequences, each position
is a separate character, with four possible character
states, the four nucleotides.
DNA is a good tool for taxonomy
DNA sequences have many advantages
over classical types of taxonomic
– Character states can be scored unambiguously
– Large numbers of characters can be scored for
– Information on both the extent and the nature of
divergence between sequences is available
(nucleotide substitutions, insertion/deletions, or
A aat tcg ctt cta gga atc tgc cta atc ctg
B ... ..a ..g ..a .t. ... ... t.. ... ..a
C ... ..a ..c ..c ... ..t ... ... ... t.a
D ... ..a ..a ..g ..g ..t ... t.t ..t t..
Each nucleotide difference is a character
Sequences Reflect Relationships
• After working with sequences for a while, one develops an
intuitive understanding that for a given gene, closely related
organisms have similar sequences and more distantly related
organisms have more dissimilar sequences. These
differences can be quantified.
• Given a set of gene sequences, it should be possible to
reconstruct the evolutionary relationships among genes and
What Sequences to Study?
• Different sequences accumulate changes at
different rates - chose level of variation that is
appropriate to the group of organisms being
– Proteins (or protein coding DNAs) are constrained by
natural selection - better for very distant relationships
– Some sequences are highly variable (rRNA spacer
regions, immunoglobulin genes), while others are
highly conserved (actin, rRNA coding regions)
– Different regions within a single gene can evolve at
different rates (conserved vs. variable domains)
Orthologs vs. Paralogs
• When comparing gene sequences, it is important
to distinguish between identical vs. merely similar
genes in different organisms.
• Orthologs are homologous genes in different
species with analogous functions.
• Paralogs are similar genes that are the result of a
– A phylogeny that includes both orthologs and paralogs
is likely to be incorrect.
– Sometimes phylogenetic analysis is the best way to
determine if a new gene is an ortholog or paralog to
other known genes.
(globin) Ancestral gene
(hemoglobin) A B (myoglobin)
A1 B1 A2 B2
Before describing any theoretical or practical
aspects of phylogenetics, it is necessary to give
some disclaimers. This area of computational
biology is an intellectual minefield!
Neither the theory nor the practical applications of
any algorithms are universally accepted throughout
the scientific community.
The application of different software packages to a
data set is very likely to give different answers;
minor changes to a data set are also likely to
profoundly change the result.
A modern revision of the seals and sea lions
Genes vs. Species
• Relationships calculated from sequence data represent
the relationships between genes, this is not necessarily
the same as relationships between species.
• Your sequence data may not have the same
phylogenetic history as the species from which they
• Different genes evolve at different speeds, and there is
always the possibility of horizontal gene transfer
(hybridization, vector mediated DNA movement, or
direct uptake of DNA).
Cladistic vs. Phenetic
Within the field of taxonomy there are two
different methods and philosophies of building
phylogenetic trees: cladistic and phenetic
– Phenetic methods construct trees (phenograms) by
considering the current states of characters without
regard to the evolutionary history that brought the
species to their current phenotypes.
– Cladistic methods rely on assumptions about
ancestral relationships as well as on current data.
Darwin was a Cladist
“The natural system based on descent
with modification … the characters
that naturalists consider as showing true
affinity are those which have been inherited
from a common parent, and in so far as all
true classification is genealogical; that
community of descent is the common bond
that naturalists have been seeking.”
- Charles Darwin, Origin of Species, 1859
• Computer algorithms based on the phenetic model rely on
Distance Methods to build of trees from sequence data.
• Phenetic methods count each base of sequence difference
equally, so a single event that creates a large change in
sequence (insertion/deletion or recombination) will move two
sequences far apart on the final tree.
• Phenetic approaches generally lead to faster algorithms and
they often have nicer statistical properties for molecular data.
• The phenetic approach is popular with molecular
evolutionists because it relies heavily on objective character
data (such as sequences) and it requires relatively few
• It is often useful to measure the genetic distance between
two species, between two populations, or even between
• The entire concept of numerical taxonomy is based on
computing phylogenies from a table of distances.
• In the case of sequence data, pairwise distances must be
calculated between all sequences that will be used to build
the tree - thus creating a distance matrix.
• Distance methods give a single measurement of the
amount of evolutionary change between two sequences
since divergence from a common ancestor.
Calculate the distance CORRECTING FOR MULTIPLE HITS
The Distance Matrix
Rat 0.0000 0.0646 0.1434 0.1456 0.3213 0.3213 0.7018
Mouse 0.0646 0.0000 0.1716 0.1743 0.3253 0.3743 0.7673
Rabbit 0.1434 0.1716 0.0000 0.0649 0.3582 0.3385 0.7522
Human 0.1456 0.1743 0.0649 0.0000 0.3299 0.2915 0.7116
Oppossum 0.3213 0.3253 0.3582 0.3299 0.0000 0.3279 0.6653
Chicken 0.3213 0.3743 0.3385 0.2915 0.3279 0.0000 0.5721
Frog 0.7018 0.7673 0.7522 0.7116 0.6653 0.5721 0.0000
Computing a Distance Matrix
Reading sequences... Matrix 1
gtr1_human: 548 total, 548 read 1 2 3 4 5
gtr2_human: 548 total, 548 read ____________________________________________________________ ..
gtr3_human: 548 total, 548 read
| 1 | 0.00 48.61 45.50 65.74 107.70
gtr4_human: 548 total, 548 read
| 2 | 0.00 61.53 74.57 113.82
gtr5_human: 548 total, 548 read
| 3 | 0.00 68.93 104.43
Computing distances using Kimura method... | 4 | 0.00 110.86
1 x 2: 48.61 1 x 3: 45.50 | 5 | 0.00
1 x 4: 65.74 1 x 5: 107.70
2 x 3: 61.53 2 x 4: 74.57
2 x 5: 113.82 3 x 4: 68.93
3 x 5: 104.43 4 x 5: 110.86
• Distances between pairs of DNA sequences are relatively
simple to compute as the sum of all base pair differences
between the two sequences.
– this type of algorithm can only work for pairs of sequences that are
similar enough to be aligned
• Generally all base changes are considered equal
• Insertion/deletions are generally given a larger weight than
replacements (gap penalties).
• It is also possible to correct for multiple substitutions at a
single site, which is common in distant relationships and
for rapidly evolving sites.
Correction for multiple hits
• Only differences can be observed directly – not distances
• All distance methods rely (crucially) on this
• A great many models used for nucleotide sequences (e.g.
JC, K2P, HKY, Rev, Maximum Likelihood)
• aa sequences are infinitely more complicated!
• Can take account of different rates of evolution at sites
(e.g. gamma distribution)
• Accuracy falls off drastically for highly divergent
Amino Acid Distances
• Distances between amino acid sequences are a bit more
complicated to calculate.
• Some amino acids can replace one another with relatively little
effect on the structure and function of the final protein while
other replacements can be functionally devastating.
• From the standpoint of the genetic code, some amino acid
changes can be made by a single DNA mutation while others
require two or even three changes in the DNA sequence.
• In practice, what has been done is to calculate tables of
frequencies of all amino acid replacements within families of
related protein sequences in the databanks: i.e. PAM and
The PAM 250 scoring matrix
A R N D C Q E G H I L K M F P S T W Y V
R -2 6
N 0 0 2
D 0 -1 2 4
C -2 -4 4 -5 4
Q 0 1 1 2 -5 4
E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein
Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation,
Silver Spring, MD.
Clustering algorithms use distances to calculate
phylogenetic trees. These trees are based solely on
the relative numbers of similarities and differences
between a set of sequences.
– Start with a matrix of pairwise distances
– Cluster methods construct a tree by linking the least
distant pairs of taxa, followed by successively more
• The total length of all branches in the tree
should be a minimum
• It has been shown that the minimum
evolution tree is expected to be the true tree
provided branch lengths corrected for
• The simplest of the distance methods is the UPGMA
(Unweighted Pair Group Method using Arithmetic averages)
• The PHYLIP programs DNADIST and PROTDIST
calculate absolute pairwise distances between a group of
sequences. Then the GCG program GROWTREE uses
UPGMA to build a tree.
• Many multiple alignment programs such as PILEUP use a
variant of UPGMA to create a dendrogram of DNA
sequences which is then used to guide the multiple alignment
• The Neighbor Joining method is the most popular
way to build trees from distance measurements
(Saitou and Nei 1987, Mol. Biol. Evol. 4:406)
– Neighbor Joining corrects the UPGMA method for its (frequently
invalid) assumption that the same rate of evolution applies to each
branch of a tree.
– The distance matrix is adjusted for differences in the rate of
evolution of each taxon (branch).
– Neighbor Joining has given the best results in simulation studies
and it is the most computationally efficient of the distance
algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)
2 6 6
• For character data about the physical traits of
organisms (such as morphology of organs etc.)
and for deeper levels of taxonomy, the cladistic
approach is almost certainly superior.
• Cladistic methods are often difficult to
implement with molecular data because all of
the assumptions are generally not satisfied.
• Cladistic methods are based on the assumption that a
set of sequences evolved from a common ancestor by
a process of mutation and selection without mixing
(hybridization or other horizontal gene transfers).
• These methods work best if a specific tree, or at least
an ancestral sequence, is already known so that
comparisons can be made between a finite number of
alternate trees rather than calculating all possible trees
for a given set of sequences.
• Parsimony is the most popular method for
reconstructing ancestral relationships.
• Parsimony allows the use of all known evolutionary
information in building a tree
– In contrast, distance methods compress all of the
differences between pairs of sequences into a single
Building Trees with Parsimony
• Parsimony involves evaluating all possible trees
and giving each a score based on the number of
evolutionary changes that are needed to explain
the observed data.
• The best tree is the one that requires the fewest
base changes for all sequences to derive from a
• Check each topology
• Count the minimum number of changes required
to explain the data
• Choose the tree with the smallest number of
• Usually performs well with closely related
sequences – but often performs badly with very
distantly related sequences
• With distantly related sequences homoplasy
becomes a major problem
• Consider four sequences: ATCG, TTCG,
ATCC, and TCCG
• Imagine a tree that branches at the first
position, grouping ATCG and ATCC on
one branch, TTCG and TCCG on the other
• Then each branch splits, for a total of 3
nodes on the tree (Tree #1)
Compare Tree #1 with one that first divides ATCC on its own
branch, then splits off ATCG, and finally divides TTCG from
TCCG (Tree #2).
Trees #1 and #2 both have three nodes, but when all of the
distances back to the root (# of nodes crossed) are summed,
the total is equal to 8 for Tree #1 and 9 for Tree #2.
Tree #1 Tree #2
• Require a model of evolution
• Each substitution has an associated
likelihood given a branch of a certain length
• A function is derived to represent the
likelihood of the data given the tree,
branch-lengths and additional parameters
• Function is minimized
• The method of Maximum Likelihood attempts to
reconstruct a phylogeny using an explicit model of
• This method works best when it is used to test (or
improve) an existing tree.
• Even with simple models of evolutionary change,
the computational task is enormous, making this the
slowest of all phylogenetic methods.
Models can be made more parameter rich
to increase their realism
• The most common additional parameters are:
– A correction to allow different substitution rates
for each type of nucleotide change
– A correction for the proportion of sites which are
unable to change
– A correction for variable site rates at those sites
which can change
• The values of the additional parameters will be
estimated in the process
• Maximum likelihood predicts ancestral sequences
– at branch points in the tree (nodes)
• can provide information about the timing of the
acquiring of a novel trait or mutation
• PAML (Phylogenetic Analysis using Maximum
– Confidence intervals provided
– Selection can be inferred
Assumptions for Maximum Likelihood
• The frequencies of DNA transitions (C<->T,A<->G) and
transversions (C or T<->A or G).
• The assumptions for protein sequence changes are taken
from the PAM matrix - and are quite likely to be violated in
• Since each nucleotide site evolves independently, the tree is
calculated separately for each site. The product of the
likelihood's for each site provides the overall likelihood of
the observed data.
The Molecular Clock
For a given protein the rate of sequence evolution is
approximately constant across lineages
Zuckerkandl and Pauling (1965)
This would allow speciation and duplication events to be dated
accurately based on molecular data
Local and approximate molecular clocks more reasonable
Rooting the Tree
• In an unrooted tree the direction of
evolution is unknown
• The root is the hypothesized ancestor of the
sequences in the tree
• The root can either be placed on a branch or
at a node
• You should start by viewing an unrooted
Rooting Using an Outgroup
• The outgroup should be a sequence (or set of
sequences) known to be less closely related to the
rest of the sequences than they are to each other
• It should ideally be as closely related as possible
to the rest of the sequences while still satisfying
• The root must be somewhere between the
outgroup and the rest (either on the node or in a
Are there Correct trees??
• Despite all of these caveats, it is actually quite simple to
use computer programs calculate phylogenetic trees for
• Provided the data are clean, outgroups are correctly
specified, appropriate algorithms are chosen, no
assumptions are violated, etc., can the true, correct tree
be found and proven to be scientifically valid?
• Unfortunately, it is impossible to ever conclusively state
what is the "true" tree for a group of sequences (or a group
of organisms); taxonomy is constantly under revision as
new data is gathered.
Is my tree correct?
Bootstrapping is a statistical technique that can use
random re-sampling of data to determine sampling
error for tree topologies
• Leave-one-out methods
– (leave out a row, not a species)
• Agreement among the resulting trees is summarized
with a majority-rule consensus tree
• Each branch of the tree is labelled with the % of
bootstrap trees where it occurred.
• 80% is good, less than 50% is bad
Computer Software for Phylogenetics
Due to the lack of consensus among evolutionary
biologists about basic principles for phylogenetic
analysis, it is not surprising that there is a wide
array of computer software available for this
– PHYLIP is a free package that includes 30 programs
that compute various phylogenetic algorithms on
different kinds of data. Hard to use.
– CLUSTALX is a multiple alignment program that
includes the ability to create tress based on Neighbor
Joining. Very easy to use, but NJ may not always be the best
method to handle your data.
GCG Evolution programs
• Distances - simple distance matrix
• GrowTree - makes a graphic from a
Distance matrix - UPGMA or NJ
• PAUP - sophisticated, but fairly easy to use
– Includes NJ, Parsimony, and Max. Likelihood
– Also does bootstrapping
– Text and PostScript output
Other Web Resources
• Joseph Felsenstein (author of PHYLIP) maintains a
comprehensive list of Phylogeny programs at:
• Introduction to Phylogenetic Systematics,
Peter H. Weston & Michael D. Crisp, Society of Australian
• University of California, Berkeley Museum of
• There are a variety of programs for Macs and PCs,
but you can easily tie up your machine for many
hours with even moderately sized data sets (i.e.
fifty 300 bp sequences)
• Moving sequences into different programs can be
a major hassle due to incompatible file formats.
• Just because a program can perform a given
computation on a set of data does not mean that
that is the appropriate algorithm for that type of
Given the huge variety of methods for computing
phylogenies, how can the biologist determine what
is the best method for analyzing a given data set?
– Published papers that address phylogenetic issues generally
make use of several different algorithms and data sets in order
to support their conclusions.
– In some cases different methods of analysis can work
• Neighbor Joining methods generally produce just one tree, which can
help to validate a tree built with the parsimony or maximum likelihood
– Using several alternate methods can give an indication of the
robustness of a given conclusion.