AN ALTERNATING LEAST SQUARES APPROACH TO INFERRING PHYLOGENIES

Document Sample
AN ALTERNATING LEAST SQUARES APPROACH TO INFERRING PHYLOGENIES Powered By Docstoc
					Syst. Biol. 46(l):101-lll/ 1997



   AN ALTERNATING LEAST SQUARES APPROACH TO INFERRING
           PHYLOGENIES FROM PAIRWISE DISTANCES
                                             JOSEPH FELSENSTEIN
         Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, USA;
                                          E-mail: joe@genetics.zvashington.edu


        Abstract.—A computational method is presented for minimizing the weighted sum of squares of
        the differences between observed and expected pairwise distances between species, where the
        expectations are generated by an additive tree model. The criteria of Fitch and Margoliash (1967,
        Science 155:279-284) and Cavalli-Sforza and Edwards (1967, Evolution 21:550-570) are both
        weighted least squares, with different weights. The method presented iterates lengths of adjacent
        branches in the tree three at a time. The weighted sum of squares never increases during the
        process of iteration, and the iterates approach a stationary point on the surface of the sum of
        squares. This iterative approach makes it particularly easy to maintain the constraint that branch
        lengths never become negative, although negative branch lengths can also be allowed. The method
        is implemented in a computer program, FITCH, which has been distributed since 1982 as part of
        the PHYLIP package of programs for inferring phylogenies, and is also implemented in PAUP*.
        The present method is compared, using some simulated data sets, with an implementation of the
        method of De Soete (1983, Psychometrika 48:621-626); it is slower than De Soete's method but
        more effective at finding the least squares tree. The relationship of this method to the neighbor-
        joining method is also discussed. [Alternating least squares; distance methods; Fitch-Margoliash
        method; phylogenies.]


   Distance matrix methods have long been
used for inferring phylogenies, and their
popularity has increased in recent years as                (T being the set of all tips on the tree) so
potential users have become aware of long-                 that they were proposing a least squares fit
branch attraction problems that can afflict                of observed to expected distances. The sta-
parsimony methods. They have been par-                     tistical model implicit in this criterion is
ticularly widely used with molecular se-                   that the distances D,7 are distributed inde-
quences. There are a wide variety of dif-                  pendently, with mean dtj and variance pro-
ferent distance matrix methods, including                  portional to the reciprocal of w{j. Fitch and
neighbor joining (Saitou and Nei, 1987),                   Margoliash took the variance of the dis-
minimum evolution (Kidd and Sgaramel-                      tances to be proportional to their expecta-
la-Zonta, 1971; Rzhetsky and Nei, 1992),                   tions dtj and approximated this by choos-
and the least squares methods. The least                   ing as their weight the squared inverse of
squares methods are of particular interest                 the observed distance
because they use a single objective func-
tion to solve for branch lengths and to                                        wtj = 1/Djj,                   (2)
choose among tree topologies and can thus                  whereas Cavalli-Sforza and Edwards as-
be related to the least squares method of                  sumed homogeneity of variances and thus
statistical estimation. Some, although not                 chose w{j = 1. In both methods, we have
all, computer simulations (cf. Kuhner and                  an additive tree model: the expectations d(j
Felsenstein, 1994) have shown that least                   are the sums of the branch lengths along a
squares methods perform better than do                     path from species i to species ; in an un-
the other distance-based methods.                          rooted tree whose tips (terminal nodes)
   Fitch and Margoliash (1967) and Cavalli-                are the species. The branch lengths are to
Sforza and Edwards (1967) proposed dif-                    be estimated by minimizing the weighted
ferent but closely related criteria for fitting            sum of squares Q. I have reviewed the bi-
trees to distance matrices. Their criteria                 ological and statistical issues involved in
were both of the form                                      inferring phylogenies from distance matri-
                                                       101
102                                    SYSTEMATIC BIOLOGY                                  VOL. 46

ces (Felsenstein, 1984); references to other
distance methods will be found in that pa-
per.
   One of the difficulties with least squares
methods has been computing the branch
                                                     k 9
                                                  Vi
lengths. These can be computed for a giv-
en tree topology by solving a set of linear
equations (Cavalli-Sforza and Edwards,
1967), one for each branch. However, this
computation can be difficult, and it also
does not allow us to constrain the branch
lengths to be nonnegative.
   The purpose of this paper is to provide
details of an alternating least squares
method that attempts to minimize Q for a
given tree topology. It works for any
weighting system for which the weights do
not depend on the expected distances d{j
and lends itself particularly well to the
maintenance of the constraint that the          FIGURE 1. The tree used to illustrate the pattern of
branch lengths not become negative. The least squares computations.
iterations result in a sequence of values of
Q that are monotonic and nonincreasing
and converge to a stationary point on the                      THE METHOD
least squares surface. Combined with an                           Notation
algorithm for searching among tree topol-        Figure 1 illustrates the notation em-
ogies, this alternating least squares meth- ployed. Some interior node of the unrooted
od forms the basis of a computer program, tree is numbered 0. All other nodes are
FITCH, which has been distributed as part numbered in some order 1, . . . , 2n — 3,
of the PHYLIP package since early 1982. where n is the number of tips. It is usually
PHYLIP executables, source code, and doc- most convenient to number the tips 1
umentation are available on the World through n. The branches of the tree are
Wide Web at http://evolution.genetics. also numbered: each is assigned the num-
washington.edu/phylip.html. The algo- ber of the node at the end furthest from
rithm described in this paper also forms node 0. In Figure 1, tips i, j , and k and
the basis for the least squares distance ma- node / are shown. The length of branch i
trix method in the program PAUP* (Swof- is denoted vt.
ford, 1996).                                    The tree topology will be specified by
                                              constants xijk/ where x»k is 1 if branch k lies
   The alternating least squares method on the path from node i to node ; and 0
(e.g., Wold, 1966) depends on a transfor- otherwise. In practice, the array x need
mation that temporarily reduces the di- not be formed because its elementsijik be          can
mensionality of the problem. A method is recovered as needed from the data struc-
presented here for finding the least squares tures representing the tree. The D and the
                                                                                        tj
branch lengths for three branches of the wtj are assumed to be given at the begin-
tree at a time by "pruning" all but those ning of the computation for all pairs of tips
branches from the tree and solving exactly i and ;'.
for the remaining three branch lengths. By
repeating this operation for different parts                  Pruning the Tree
of the tree, one approaches asymptotically      Our immediate objective will be, start-
a least squares solution.                     ing with some arbitrarily specified branch
1997      FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES              103

lengths, to minimize the sum of squares Q respect to all branch lengths except i and
with respect to the lengths of the three j . Let S be the set of all remaining tips after
branches incident on an interior node (e.g., i and ;' have been removed and after the
node 0), holding all other branch lengths node at which they join, k, has been added
constant. We can reduce the size of the to the set of tips. If k were actually a tip it
problem by "pruning" the tree, i.e., remov- would have distances Dkl to all other tips
ing tips i and ;' and replacing the interior still on the tree. In this smaller tree, which
node k with a new tip, in such a way that lacks segments i and ; but has k, the sum
the sum of squares for this new tree as a of squares is
function of the lengths of its remaining
branches differs from the sum of squares
Q by a calculable constant. This constant
does not depend on the three branch the dm being the sums of branch lengths
lengths whose lengths we are to improve. between p and q in this pruned tree. We
By pruning the tree repeatedly we can fi- would like to find a constant K, a set of
nally reduce it to a three-species tree distances Dkl, and a set of weights wkl such
whose central node is node 0. The least that Q = R + K, where K does not depend
squares branch lengths for this tree can be on the lengths of any of the branches ex-
easily computed, and the value of Q for the cept i and ;'. If we can do this, then the
full tree differs from that of this three-spe- quadratic form R will be minimized at the
cies tree by a known constant. This means same values of the remaining branch
that we can pick any interior node, can lengths as will Q.
find for the three branches incident upon        We can find K, the Dkl, and the wkl by a
it the branch lengths that minimize Q, and process of equating coefficients in R to cor-
can then find the value of Q. We can move responding terms in Q. We consider Q and
around the tree, doing this successively for R as functions of the branch lengths vu. For
different interior nodes. The result is the all pairs of tips (p, q) in S, neither of which
required iterative method for minimizing can be equal to i, j , or k, the terms in R are
the sum of squares Q.                          identical to those in Q. This leaves us with
   Considered as a function of the un- only terms involving i or ; in Q or k in R.
known branch lengths vir the sum of Taking the difference Q — R, we can write
squares is                                     this as
                                                                                       - dny
                                           (3)
                                                           + wlj{Dij -   dify
where T is the set of all tips and B is the
set of all branches of the tree. This is a qua-            - 2 w«(DH - dkly.               (5)
dratic form in the v{. It could be minimized
by differentiating with respect to the vt and       The lengths of the branches other than i
equating the derivatives to zero to obtain        and ; remain in Equation 5 only in the
normal equations for the vif but this ap-         quantities dkl. Also,
proach will not be taken here. We want to
replace the problem of minimizing Q with
respect to the full set of the v{ by a smaller                       = dk                  (6)
problem. Consider the phylogeny in Figure
1. For the moment we revert to the simpler        Inserting the expressions in Equations 6
Equation 1, keeping in mind that the di] are      into Equation 5 and collecting the terms in
sums of branch lengths. Suppose that we           dkl and d\\, the coefficient of d£, in Q — R is
were to regard the lengths of the branches                       wkl - wa - wn
connected to two adjacent tips, i and ;, as
constants and consider minimizing Q with          and the coefficient of dkl is
104                                      SYSTEMATIC BIOLOGY                                   VOL. 46

       -2wklDk       2waDa     2wnDn                  0 in any of the three possible directions,
                                                      one finds a rooted bifurcating tree. In any
                                                      rooted bifurcating tree having two or more
If the difference between Q and R is to be            tips, there are always at least two tips
a constant independent of the lengths of              branching from the same interior node: in
the branches remaining on the pruned tree,            Figure 1 tips i and ;' are two of these, but
then the above coefficients must all be               three other such pairs are also visible.
zero. Equating them to zero, we can solve             Equations 7, 8, and 10 permit us to reduce
for the wkl and the Dkl:                              the size of the problem by removing two
                 wkl = wa +                     (7)   tips and creating one new one while not
                                                      affecting the values of the least squares es-
and                                                   timates of the lengths of the remaining
                                                      branches. We can continue this process, re-
                                           ••   (8)   peatedly applying these equations until
                                                      only three tips are left, all connected di-
   Because the terms in dkl have been elim-           rectly to the designated interior node
inated from Q — R, we have now elimi-                 (node 0). Designate these three nodes as a,
nated all terms that could contain any of             b, and c.
the branch lengths other than v( and vr The              The problem is now reduced to mini-
difference between Q and R must then be               mizing the sum of squares
a constant not containing the lengths of
any branches other than i and /'. We can                      Q = K + wab(Dab - v a - vby
find it by collecting all the terms in Equa-                      + wbc(Dbc - vb - vcf
tion 5 that do not have dkl in them:                              + zvac(Dac - va - Vc)2. (11)
                                                         As Farris (1972) pointed out, the mini-
                                                      mum of Equation 11 is achieved when the
                                                      rightmost three terms become zero when
                                                (9)   the branch lengths va, vb, and vc are chosen
                                                      to satisfy the equations
and using Equations 7 and 8, Equation 9
can be reduced after some algebra to                                               vb
                                                                      Dbc   = vb   vc
Q-R     = w,,-D? + 2 Wi/W//                                                                      (12)
                     us                                                            Vc,

           X[(Dfl - *,-) - (D;7 - *,.)]2              these being the normal equations that are
                                                      derived by differentiating Q with respect
           -s-(wfl + wn),               (10)          to the branch lengths and equating the deriv-
which can never be negative.                          atives to zero. The solution (Farris, 1972) is
   Equations 7, 8, and 10 give us the means                    v. = (Dab + Dac -  Dbc)l2
of reducing the size of the problem from n                     vb = (Dab + Dbc -  Dac)/2
tips to n — 1 tips, dropping two branch                        vc = (Dac + Dbc - DJ/2.            (13)
lengths from the minimization problem.
This process can be used to construct an                 Having used Equations 7 and 8 to com-
alternating least squares method for mini-            pute the quantities wab, wbc, wac, Dab, Dbc, and
mizing Q.                                             Dac, we have minimized the sum of
                                                      squares Q with respect to the three branch
         Iteration of Branch   Lengths                lengths va, vb, and vc, holding all the other
   Consider only the case of an unrooted              branch lengths constant. It follows that Q
bifurcating tree, which has each interior             is nonincreasing during each such step.
node connected to three neighbors. As can                If we follow the strategy of moving
be seen in Figure 1, if one proceeds out-             through the tree, taking each interior node
wards from an internal node such as node              of the tree in turn, pruning the tree to re-
1997      FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES             105

duce the problem to minimization of Q            These are connected into an unrooted tree
with respect to the three branch lengths in-     (only one topology is possible), and Equa-
cident on that node, and finding the opti-       tions 13 are used to solve for least squares
mum lengths for those three branches, the        branch lengths. The program then consid-
successive values of the sums of squares Q       ers where the fourth species can be added
cannot increase: they will form a monoton-       to the tree. There are three possible places
ic nonincreasing sequence bounded below          that a new internal node could be added,
by zero. The sequence of values of Q must        with the fourth species connected to it, and
thus converge; it seems a reasonable ex-         these are on the interiors of the three
pectation that the sequence of values of the     branches. Each of these positions is tried
branch lengths will also converge.               in turn, and the least squares branch
   The iteration at each stage sets the deriv-   lengths are found for each of the resulting
atives of Q with respect to three of the         topologies. The program accepts that
branch lengths to zero. When further iter-       placement of the species that results in the
ation through all interior nodes of the tree     smallest sum of squares.
produces no change, it follows that the de-         The algorithm continues in this fashion,
rivatives of Q with respect to all branch        adding each species to all possible places
lengths are zero, so that we have reached        in the tree and picking the placement that
a stationary point. We cannot guarantee          minimizes the sum of squares after itera-
from this that the stationary point is a min-    tive computation of the least squares
imum. However, a glance at Equation 1            branch lengths. It is convenient to start the
shows that if the wi] are nonnegative the        iteration for each topology by calculating
quadratic form Q cannot ever be negative;        the lengths of the branches incident upon
this fact is sufficient to guarantee that no     the newly introduced node, because that
saddle points or maxima can exist. We            provides starting values for their lengths.
have not ruled out that there could be di-       The iteration for each topology proceeds
rections in which the sum of squares might       by traversing the tree outwards from that
be unchanging. This is the case where the        node, optimizing branch lengths for each
quadratic form is positive semidefinite. In      internal node encountered. Although it
such a case the algorithm will reach one of      would perhaps be best to repeat the tra-
the tied points along the line (or plane) of     versal until the sum of squares stopped
equally good solutions. We can in any case       changing, for the data sets I have encoun-
guarantee that we have minimized the             tered four passes through the tree is quite
quadratic form of Equation 1.                    sufficient.
   An alternative scheme would of course            After each species is added to the tree,
be to set up and solve the linear equations      except the fourth species, a series of local
that are the normal equations for the least      rearrangements is carried out to see if the
squares problem. The present method              tree can be improved. A local rearrange-
amounts to an iterative approach to solv-        ment switches the order of adjacent
ing these equations, differing from conven-      branches in the tree. In the present version
tional iterative approaches such as Gauss-       of the program all local rearrangements
Seidel iteration by changing the variables       are tried after each species is added. If any
three at a time instead of one at a time. The    local rearrangements improve the tree, as
present method also avoids directly setting      judged by the sum of squares, the rear-
up the equations and takes advantage of          rangement process is continued until no
the sparseness of the coefficient matrix in      further improvement by local rearrange-
a natural way. It also enables us to main-       ment is possible.
tain the constraint that branch lengths not
be negative, if that is desired.                    The user has the option of specifying
                                                 that, after the last species is added to the
               The Algorithm                     tree, the last bout of rearrangements
  The strategy used in the program               should be global. In that case, each subtree
FITCH starts with the first three species.       is removed from the tree and reinserted in
106                                SYSTEMATIC BIOLOGY                             VOL. 46

all possible places, the best of these being      A = [wJP* - vb) + wac(Dac - vc)]
chosen and the process continued until no             +(wab + wac).                (14)
further improvement results. Swofford and
Olsen (1990) described this rearrangement     If the minimum occurs at a negative
strategy as subtree pruning and regraft-      branch length, then the nonnegative value
ing. At that point, we have a tree that can-  of that branch length that minimizes Equa-
not be improved upon by moving any sin-       tion 11 will be zero, because Equation 11
gle group. This does not guarantee that we    is a quadratic in each of its three variables
have found the best topology, but it gives    and has positive curvature in each. Thus,
us some reassurance that we have at least     when a branch length computed from
made a serious attempt to find it.            Equation 14 becomes negative, it is instead
                                              set to zero. The branch lengths v^ vb, and
   In PAUP*, the strategy for searching       vc are determined in this way in turn until
among tree topologies is different and will   there is no further change. Although this
not be described here.                        procedure is not guaranteed always to find
                                              the best nonnegative values of the branch
      Avoiding Negative Branch Lengths        lengths, it seems to do quite well in prac-
   I have argued elsewhere (Felsenstein, tice.
1984,1986) that it is appropriate to find the
                                                      Complexity of the Computation
least squares tree among all those having
no negative branch lengths. The present al-      In calculating how much effort is in-
gorithm does not avoid negative branch volved in the computation, we must con-
lengths; it is entirely possible for one or sider how many topologies of each given
more of the solutions to Equations 13 to be size are examined and how much effort is
negative. However, it is quite easy to alter involved in optimizing the branch lengths
the algorithm to avoid negative branch for each topology. We start with three spe-
lengths; this is what is done by default in cies, adding the fourth in each of three
FITCH because allowing negative branch possible places. When we add the nth spe-
lengths is an option.                         cies, there are In - 5 possible places to
   When finding the optimum values of the     add it. There are also, for n > 4, 2n — 6
three branch lengths around an interior local rearrangements that will be tried, as-
node, suppose we require that Equation 11 suming that none of them causes the tree
be minimized without allowing any of v^ to be altered, which in turn would require
                                              additional rounds of rearrangement. The
vb/ or vc to become negative. If the solution addition of the nth species will thus cause
to Equations 13 does not find any negative 4n - 11 evaluations of the least squares
branch length, then there is no problem. If branch lengths.
one or more of the branch lengths is neg-        The evaluation of branch lengths on a
ative, in principle we should examine all tree requires four passes through the tree.
seven possible patterns of zero branch Each of these prunes the tree n — 3 times
lengths in which one, two, or all three of if it possesses n tips, and each time it is
the branch lengths are zero.                  pruned the distances of the central node to
   It would be possible to minimize Equa- all other nodes are updated. When there
tion 11 for each and pick the best solution, are m nodes remaining on the tree as it is
but I have followed a simpler and less ex- being pruned, this updating requires on
act strategy. Any of the branch lengths that the order of 2m — 3 operations.
have become negative are set to zero, and        Putting all of this together, one can show
the resulting values are taken as a starting that estimating a tree and its branch
point. Each of the three branch lengths is lengths requires on the order of n4 opera-
then considered in turn. Fixing the values tions. This means that the algorithm will
of the other two, the least squares solution be fairly slow for large numbers of tips.
for vn is                                     For 20 species, FITCH required 46.9 sec to
1997        FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES                              107
  TABLE 1. The immunological distance data set of Sarich (1969) symmetrized. Dog = Canis familiaris; bear
= Ursus americanus; raccoon = Procyon lotor; weasel = Mustela vison; seal = Phoca vitulina richardii; sea lion =
Zalophus californicus; cat = Felis domestica; monkey = Aotus trivirgatus. These data were used to construct the
tree given in Figure 2.

                Dog          Bear       Raccoon      Weasel         Seal      Sea lion       Cat       Monkey
Dog                0          32           48            51          50          43           98         148
Bear              32           0           26            34          29          33           84         136
Raccoon           48          26            0            42          44          44           92         152
Weasel            51          34           42             0          44          38           86         142
Seal              50          29           44            44           0          24           89         142
Sea lion          48          33           44            38          24           0           90         142
Cat               98          84           92            86          89          90            0         148
Monkey           148         136          152           142         142         142          148           0


execute on my DECstation 5000/125, a ma-                  Consider the data set of Sarich (1969),
chine whose SPECfp92 rating is approxi-                   which is reproduced in Table 1. Running
mately 25 (so that it is approximately as                 FITCH on this data set without allowing
fast as a 486DX / 66). For 40 species (the 20             negative branch lengths, using the Fitch-
species were a random sample from these                   Margoliash criterion, gives the tree shown
40), it took 730 sec, which is 15.565 times               in Figure 2, which is rooted using the
longer, very close to the expected multi-                 monkey as an outgroup. The weighted
plier of 24 = 16.                                         sum of squares for this tree is 0.06996. Fig-
                                                          ure 3 shows the progress of branch length
            A NUMERICAL EXAMPLE                           iteration when we take this tree topology
  We can get some feel for the progress of                and initially set all branch lengths equal to
the iteration by watching it in an example.

               7.53       seal sea lion                       100   -
                                                  387
              weasel




                                                                                    Iteration
                                                             FIGURE 3. Change in branch lengths over eight
                             monkey                       passes through the tree in Figure 2, starting with ar-
                                                          bitrary initial values. The lengths essentially reach
  FIGURE 2. The least squares estimate for the Sarich     their final values after four passes. Because each iter-
(1969) data set, using the Fitch-Margoliash criterion     ation consists of successive changes around each of
and the algorithm described in this paper. Horizontal     the six interior nodes, each time interval is divided
distance is proportional to branch length. The tree has   into sixths and each change is shown at the point that
been rooted with monkey as the outgroup.                  it occurred.
108                                 SYSTEMATIC BIOLOGY                                        VOL. 46

1. Eight passes through the tree are shown;                            Xenopus
the branch lengths essentially cease chang-                            Latimeria
ing after the first four.                                                Sebastolobus
                                                                    '— Squalus
   As expected, the iteration succeeds rap-
                                                                        Petromyzon
idly. For moderately clean data, the itera-                                                   • Myxine
tion computes the internal branch lengths                            —    Branchiostoma
using Equations 13, in effect using weight-                            •     Styela
ed averages of the distances between                                   '          Herdmania
members of the three different subtrees                                Saccoglossus'
defined by an internal node. Once the tree                               Ophiopholis
approaches its final branch lengths, if there                        Strongylocentrotus
is not serious internal conflict in the data                  Placopecten
these weighted averages depend rather lit-                         Limicolaria
tle on the details of the branch lengths                         Eurypelmia
within the subtrees, and hence the branch                       - Tenebrio
lengths rapidly stabilize. It is the very in-     FIGURE 4. The tree produced by FITCH on the 16S
dependence of evolution in the different        rRNA data set of Turbeville et al. (1994), when dis-
subtrees, the very "treeness" of the data,      tances are computed by the Kimura (1980) two-pa-
                                                rameter method with a transition/transversion ratio
that ensures the success of this alternating    of 2.0. The alignment and selection of sites used was
least squares approach.                         that recommended by Turbeville et al.
   AN EXAMPLE WITH SEQUENCE DATA
   To give a better feel for how the present
algorithm copes with real data, I analyzed      lengths to be nonnegative, is the innovative
the metazoan ribosomal RNA (rRNA) data          method of De Soete (1983), which starts
set of Turbeville et al. (1994), which in-      with the observed distances and then
cludes 16 species, many of them chordates       gradually brings them closer to additive
or their near relatives. These data can be      tree distances while searching for the least
retrieved from the alignments section of        squares fit. This method is quite different
the EMBL database as alignment DS16914.         from the present approach, which searches
Both their analysis and this one used only      in the space of additive trees. De Soete's
a subset of better aligned sites. FITCH         method approaches that space from out-
took 17.6 sec to analyze these distances on     side. Therefore, it may carry out a far more
a 486DX / 33 running the Linux operating        effective search for the tree topology that
system. On a Digital Alphastation 400           leads to the minimum sum of squares.
4/233, it took 1.46 sec. The tree (Fig. 4) is      The most widely used implementation
identical to the tree produced by Turbe-        of De Soete's method is probably Isadt, a C
ville et al. using the same program. In the     program by Michael Maciukenas distrib-
study by Turbeville et al., the trees pro-      uted in Stephen Smith's Genetic Data En-
duced by FITCH were similar to those pro-       vironment (GDE) package of programs for
duced by neighbor joining and by likeli-        DNA analysis. To test whether the present
hood and to those produced by parsimony         method had any advantages over Isadt, I
analysis of a combined molecular and            simulated the evolution of 20 DNA se-
morphological data set, and these trees all     quences on a tree. The tree was generated
supported the existence of the Chordata.        by a branching process, with the rate of
These trees differed from the tree pro-         branching of a lineage being 1 per unit
duced by parsimony analysis of the molec-       time. Branching was continued until the
ular data alone. The utility of a comparison    21st lineage was just about to be produced,
based on a single case is, however, limited.    and then the process was stopped.
                                                   DNA sequences 300 bases in length were
 COMPARISON WITH DE SOETE'S METHOD              stochastically evolved along the resulting
   An approach to searching over additive       trees. Each branch had an expected rate of
trees, constraining the estimated branch        change per unit time of either 0.066667 or
1997     FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES                      109

0.2, these values being chosen with equal
probability. Thus the resulting trees were
not ultrametric when the branch lengths
relevant to base change are considered. All
sites changed with equal probabilities, and
a Kimura two-parameter model (Kimura,
1980) was used, with an instantaneous
transition/trans version ratio of 2.
   One hundred trees and data sets were
produced by this simulation. Of these, 30       J
had at least two identical DNA sequences          FIGURE 5. Tree form used in the computation of
on them. The resulting distances in those       the lengths of branches i and ; in the neighbor-joining
cases caused Isadt to terminate with an er-     method.
ror message. The remaining 70 distance
matrices were analyzed with both pro-
grams. The sum of squares of the fit from       pologies and/or better branch lengths than
the Isadt results is not provided by that       can De Soete's method, in spite of the con-
program; it was computed using the esti-        straint on its search algorithm to stay with-
mated trees that emerged from Isadt as          in the space of additive trees. It is not clear
user trees whose branch lengths were not        how other implementations of De Soete's
to be reestimated in a run of FITCH. In         method would perform in these compari-
analysis of the distance matrices by            sons.
FITCH, the unweighted least squares
method of Cavalli-Sforza and Edwards             RELATIONSHIP TO THE NEIGHBOR-JOINING
(1967) was used because Isadt was also us-                     METHOD
ing that unweighted criterion.                     The neighbor-joining method of Saitou
   In every case, Isadt was at least 10 times   and Nei (1987) has become popular be-
faster than FITCH. Of the 70 distance ma-       cause of its speed: its execution time is pro-
trices on which both methods were run, 4        portional to the cube of the number of spe-
had results that were identical to five dec-    cies. Simulation studies (Kuhner and
imal places. In all 66 of the matrices for      Felsenstein, 1994) have shown it to be near-
which the results were different, FITCH         ly as effective as the Fitch-Margoliash
gave a lower sum of squares. For many of        method in recovering the true phylogeny.
these, the differences were small, indicat-     It estimates the lengths of branches to two
ing that some accuracy may have been sac-       tips that are "neighbors" on the tree and
rificed for speed in Isadt. But for 25 of the   then removes these and replaces them
66 matrices, the Isadt sum of squares was       with a new tip. Distances are calculated
>1% greater, and for 5 matrices it was          from the new tip to all other tips currently
>10% greater. When the trees were exam-         on the tree. Saitou and Nei showed that the
ined more closely, for 24 matrices the tree     step that estimates the branch lengths of
topologies estimated by Isadt and FITCH         two neighbors makes a least squares esti-
differed. In none of these was the differ-      mate, by the unweighted criterion of Cav-
ence due to rearrangement of zero-length        alli-Sforza and Edwards (1967) of the
branches. For 11 of these 24 matrices, the      branch lengths v{ and vjf for a tree that has
sum of squares differed by >1%, and for         i and j as neighbors but has all the other
2 matrices it differed by >10%. Thus, for       tips that remain on the tree branching
14 matrices, two trees of the same topolo-      from a multifurcating node. Figure 5
gy differed in sum of squares by >1%, and       shows the tree topology to which this least
for 3 matrices the sum of squares differed      squares estimate applies. Neighbor joining
by >10%.                                        may thus be regarded as an approximation
   These results suggest that the present       to the least squares algorithm of Cavalli-
method can sometimes find better tree to-       Sforza and Edwards (1967). It differs from
110                                SYSTEMATIC BIOLOGY                                      VOL. 46

the present method in that having settled       the sum of the lengths of these branches.
on branch lengths vt and vf it never returns    The present algorithms can thus be used
to that part of the tree to reestimate them.    to do the branch length calculations in a
   Saitou and Nei's algorithm "prunes" the      minimum evolution method. It is thus not
tree. Its recalculation of distances from       hard to modify a program that infers phy-
node k to each remaining tip closely par-       logenies by least squares to make one that
allels the current method: it uses              infers them by minimum evolution. This
                                                option will be provided in future versions
         Du = (Du + Dn - DJ/2.          (15)
                                                of FITCH.
Because in neighbor joining, as in the cur-
                                                                    SUMMARY
rent method, for two neighbors on the tree
A; = vi + v)i w e c a n substitute this equa-      In this paper, I have not described a new
tion into Equation 15 and rearrange it into     distance matrix method. Instead, I have in-
the case of Equation 8 for which wa = w]X       troduced a new computational framework
= 1, as is true for the Cavalli-Sforza and      for the long-existing least squares family of
Edwards criterion.                              distance matrix methods. This framework,
   In the neighbor-joining method, the val-     by making effective use of the structure of
ues of vt and vf are determined when the        the tree, allows us to maintain a constraint
tree is as shown in Figure 5. Because the       of nonnegativity of branch lengths. The al-
values are identical to those one would get     gorithm "prunes" the sum of squares on
from Cavalli-Sforza and Edwards's un-           the tree in a natural way. This process may
weighted least squares criterion, they are      be helpful for other tree calculations that
also identical to the values that the present   use least squares. It can serve as the basis
algorithm would give to v{ and vf if itera-     for either least squares estimation of the
tion was done with the tree structure of        tree or minimum evolution estimation and
Figure 5. The relative success of the neigh-    has some relationship to the neighbor-join-
bor-joining algorithm in approximating          ing method. When used in connection
the least squares solution to a completely      with local rearrangement of the tree, it
resolved tree suggests that the estimate is     seems more effective, if slower, than De
not very sensitive to the details of the res-   Soete's innovative algorithm for searching
olution of the multifurcation in Figure 5.      the space of trees. It forms the basis of the
David Swofford (pers. comm.) has pointed        least squares distance matrix calculations
out that this may also be the reason for the    in the FITCH program of PHYLIP and in
rapidity with which the present algorithm       PAUP*. Having been in distribution in
converges, as shown in Figure 3. Because        PHYLIP since 1982, it has been used to
vt and Vj are not very sensitive to the other   compute most of the least squares trees
details of the structure of the tree, they      that have been published in the systemat-
reach reasonable values very rapidly.           ics and molecular evolution literature.
   We may thus regard the neighbor-join-                       ACKNOWLEDGMENTS
ing method as a quick and fairly accurate         I thank David Swofford for illuminating comments
approximation to the unweighted least           in discussion of these matters and John Huelsenbeck,
squares method.                                 Peter Beerli, and Mary Kuhner for corrections of the
                                                manuscript and helpful suggestions. This work was
      RELATIONSHIP TO THE MINIMUM               supported in part by task agreement number
           EVOLUTION METHOD                     DE-AT06-76EV71005 of contract number DE-AM06-
                                                76RLO2225 between the U.S. Department of Energy
  The minimum evolution distance matrix         and the University of Washington, Jby NSF grant num-
method (Kidd and Sgaramella-Zonta,              bers BSR-8614807, DEB-9207558, and BIR-9527687,
                                                and by NIH grants 2 R55 GM41716-04 and 1 R01
1971; Rzketsky and Nei, 1992) searches          GM51929-01.
among tree topologies. For each tree to-
pology, it evaluates branch lengths by least                      REFERENCES
squares fitting. The topology is evaluated      CAVALLI-SFORZA, L. L., AND A. W. F. EDWARDS. 1967.
not by the overall sum of squares but by          Phylogenetic analysis: methods and estimation pro-
1997        FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES                             111
  cedures. Evolution 21:550-570. [Am. J. Hum. Genet.       for estimating and testing minimum-evolution
  Suppl. 19:233-257.]                                      trees. Mol. Biol. Evol. 9:945-967.
DE SOETE, G. 1983. A least squares algorithm for fit-    SAITOU, N., AND M. NEI. 1987. The neighbor-joining
  ting additive trees to proximity data. Psychometrika     method: A new method for reconstructing phylo-
  48:621-626.                                              genetic trees. Mol. Biol. Evol. 4:406-425.
FARRIS, J. S. 1972. Estimating phylogenetic trees from   SARICH, V. M. 1969. Pinniped origins and the rate of
  distance matrices. Am. Nat. 106:645-668.                 evolution of carnivore albumins. Syst. Zool. 19:286-
FELSENSTEIN, J. 1984. Distance methods for inferring       295.
  phylogenies: A justification. Evolution 38:16-24.      SWOFFORD, D. L. 1996. PAUP*: Phylogenetic analysis
FELSENSTEIN, J. 1986. Distance methods: reply to Far-      using parsimony (*and other methods). Sinauer,
  ris. Cladistics 2:130-143.                               Sunderland, Massachusetts.
FITCH, W. M., AND E. MARGOLIASH. 1967. Construc-         SWOFFORD, D. L., AND G. J. OLSEN. 1990. Phylogeny
  tion of phylogenetic trees. Science 155:279-284.         reconstruction. Pages 411-501 in Molecular system-
KIDD, K. K., AND L. A. SGARAMELLA-ZONTA. 1971.             atics, 1st edition (D. M. Hillis and C. Moritz, eds.).
  Phylogenetic analysis: Concepts and methods. Am.         Sinauer, Sunderland, Massachusetts.
  J. Hum. Genet. 23:235-251.                         TURBEVILLE, J. M., J. R. SCHULZ, AND R. A. RAFF. 1994.
KIMURA, M. 1980. A simple model for estimating evo-    Deuterostome phylogeny and the sister group of the
  lutionary rates of base substitutions through com-   chordates: Evidence from molecules and morphol-
  parative studies of nucleotide sequences. J. Mol.    ogy. Mol. Biol. Evol. 11:648-655.
  Evol. 16:111-120.                                  WOLD, H. 1966. Nonlinear estimation by iterative
KUHNER, M. K., AND J. FELSENSTEIN. 1994. A simu-
                                                       least squares procedures. Pages 411—444 in Research
  lation comparison of phylogeny algorithms under      papers in statistics. Festschrift for J. Neyman (F. N.
  equal and unequal evolutionary rates. Mol. Biol.     David, ed.). John Wiley and Sons, London.
  Evol. 11:459-468 (erratum 12:525, 1995).               Received 6 March 1995; accepted 1 October 1996
RZHETSKY, A., AND M. NEI. 1992. A simple method          Associate Editor: Richard Olmstead