Document Sample

Syst. Biol. 46(l):101-lll/ 1997 AN ALTERNATING LEAST SQUARES APPROACH TO INFERRING PHYLOGENIES FROM PAIRWISE DISTANCES JOSEPH FELSENSTEIN Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, USA; E-mail: joe@genetics.zvashington.edu Abstract.—A computational method is presented for minimizing the weighted sum of squares of the differences between observed and expected pairwise distances between species, where the expectations are generated by an additive tree model. The criteria of Fitch and Margoliash (1967, Science 155:279-284) and Cavalli-Sforza and Edwards (1967, Evolution 21:550-570) are both weighted least squares, with different weights. The method presented iterates lengths of adjacent branches in the tree three at a time. The weighted sum of squares never increases during the process of iteration, and the iterates approach a stationary point on the surface of the sum of squares. This iterative approach makes it particularly easy to maintain the constraint that branch lengths never become negative, although negative branch lengths can also be allowed. The method is implemented in a computer program, FITCH, which has been distributed since 1982 as part of the PHYLIP package of programs for inferring phylogenies, and is also implemented in PAUP*. The present method is compared, using some simulated data sets, with an implementation of the method of De Soete (1983, Psychometrika 48:621-626); it is slower than De Soete's method but more effective at finding the least squares tree. The relationship of this method to the neighbor- joining method is also discussed. [Alternating least squares; distance methods; Fitch-Margoliash method; phylogenies.] Distance matrix methods have long been used for inferring phylogenies, and their popularity has increased in recent years as (T being the set of all tips on the tree) so potential users have become aware of long- that they were proposing a least squares fit branch attraction problems that can afflict of observed to expected distances. The sta- parsimony methods. They have been par- tistical model implicit in this criterion is ticularly widely used with molecular se- that the distances D,7 are distributed inde- quences. There are a wide variety of dif- pendently, with mean dtj and variance pro- ferent distance matrix methods, including portional to the reciprocal of w{j. Fitch and neighbor joining (Saitou and Nei, 1987), Margoliash took the variance of the dis- minimum evolution (Kidd and Sgaramel- tances to be proportional to their expecta- la-Zonta, 1971; Rzhetsky and Nei, 1992), tions dtj and approximated this by choos- and the least squares methods. The least ing as their weight the squared inverse of squares methods are of particular interest the observed distance because they use a single objective func- tion to solve for branch lengths and to wtj = 1/Djj, (2) choose among tree topologies and can thus whereas Cavalli-Sforza and Edwards as- be related to the least squares method of sumed homogeneity of variances and thus statistical estimation. Some, although not chose w{j = 1. In both methods, we have all, computer simulations (cf. Kuhner and an additive tree model: the expectations d(j Felsenstein, 1994) have shown that least are the sums of the branch lengths along a squares methods perform better than do path from species i to species ; in an un- the other distance-based methods. rooted tree whose tips (terminal nodes) Fitch and Margoliash (1967) and Cavalli- are the species. The branch lengths are to Sforza and Edwards (1967) proposed dif- be estimated by minimizing the weighted ferent but closely related criteria for fitting sum of squares Q. I have reviewed the bi- trees to distance matrices. Their criteria ological and statistical issues involved in were both of the form inferring phylogenies from distance matri- 101 102 SYSTEMATIC BIOLOGY VOL. 46 ces (Felsenstein, 1984); references to other distance methods will be found in that pa- per. One of the difficulties with least squares methods has been computing the branch k 9 Vi lengths. These can be computed for a giv- en tree topology by solving a set of linear equations (Cavalli-Sforza and Edwards, 1967), one for each branch. However, this computation can be difficult, and it also does not allow us to constrain the branch lengths to be nonnegative. The purpose of this paper is to provide details of an alternating least squares method that attempts to minimize Q for a given tree topology. It works for any weighting system for which the weights do not depend on the expected distances d{j and lends itself particularly well to the maintenance of the constraint that the FIGURE 1. The tree used to illustrate the pattern of branch lengths not become negative. The least squares computations. iterations result in a sequence of values of Q that are monotonic and nonincreasing and converge to a stationary point on the THE METHOD least squares surface. Combined with an Notation algorithm for searching among tree topol- Figure 1 illustrates the notation em- ogies, this alternating least squares meth- ployed. Some interior node of the unrooted od forms the basis of a computer program, tree is numbered 0. All other nodes are FITCH, which has been distributed as part numbered in some order 1, . . . , 2n — 3, of the PHYLIP package since early 1982. where n is the number of tips. It is usually PHYLIP executables, source code, and doc- most convenient to number the tips 1 umentation are available on the World through n. The branches of the tree are Wide Web at http://evolution.genetics. also numbered: each is assigned the num- washington.edu/phylip.html. The algo- ber of the node at the end furthest from rithm described in this paper also forms node 0. In Figure 1, tips i, j , and k and the basis for the least squares distance ma- node / are shown. The length of branch i trix method in the program PAUP* (Swof- is denoted vt. ford, 1996). The tree topology will be specified by constants xijk/ where x»k is 1 if branch k lies The alternating least squares method on the path from node i to node ; and 0 (e.g., Wold, 1966) depends on a transfor- otherwise. In practice, the array x need mation that temporarily reduces the di- not be formed because its elementsijik be can mensionality of the problem. A method is recovered as needed from the data struc- presented here for finding the least squares tures representing the tree. The D and the tj branch lengths for three branches of the wtj are assumed to be given at the begin- tree at a time by "pruning" all but those ning of the computation for all pairs of tips branches from the tree and solving exactly i and ;'. for the remaining three branch lengths. By repeating this operation for different parts Pruning the Tree of the tree, one approaches asymptotically Our immediate objective will be, start- a least squares solution. ing with some arbitrarily specified branch 1997 FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES 103 lengths, to minimize the sum of squares Q respect to all branch lengths except i and with respect to the lengths of the three j . Let S be the set of all remaining tips after branches incident on an interior node (e.g., i and ;' have been removed and after the node 0), holding all other branch lengths node at which they join, k, has been added constant. We can reduce the size of the to the set of tips. If k were actually a tip it problem by "pruning" the tree, i.e., remov- would have distances Dkl to all other tips ing tips i and ;' and replacing the interior still on the tree. In this smaller tree, which node k with a new tip, in such a way that lacks segments i and ; but has k, the sum the sum of squares for this new tree as a of squares is function of the lengths of its remaining branches differs from the sum of squares Q by a calculable constant. This constant does not depend on the three branch the dm being the sums of branch lengths lengths whose lengths we are to improve. between p and q in this pruned tree. We By pruning the tree repeatedly we can fi- would like to find a constant K, a set of nally reduce it to a three-species tree distances Dkl, and a set of weights wkl such whose central node is node 0. The least that Q = R + K, where K does not depend squares branch lengths for this tree can be on the lengths of any of the branches ex- easily computed, and the value of Q for the cept i and ;'. If we can do this, then the full tree differs from that of this three-spe- quadratic form R will be minimized at the cies tree by a known constant. This means same values of the remaining branch that we can pick any interior node, can lengths as will Q. find for the three branches incident upon We can find K, the Dkl, and the wkl by a it the branch lengths that minimize Q, and process of equating coefficients in R to cor- can then find the value of Q. We can move responding terms in Q. We consider Q and around the tree, doing this successively for R as functions of the branch lengths vu. For different interior nodes. The result is the all pairs of tips (p, q) in S, neither of which required iterative method for minimizing can be equal to i, j , or k, the terms in R are the sum of squares Q. identical to those in Q. This leaves us with Considered as a function of the un- only terms involving i or ; in Q or k in R. known branch lengths vir the sum of Taking the difference Q — R, we can write squares is this as - dny (3) + wlj{Dij - dify where T is the set of all tips and B is the set of all branches of the tree. This is a qua- - 2 w«(DH - dkly. (5) dratic form in the v{. It could be minimized by differentiating with respect to the vt and The lengths of the branches other than i equating the derivatives to zero to obtain and ; remain in Equation 5 only in the normal equations for the vif but this ap- quantities dkl. Also, proach will not be taken here. We want to replace the problem of minimizing Q with respect to the full set of the v{ by a smaller = dk (6) problem. Consider the phylogeny in Figure 1. For the moment we revert to the simpler Inserting the expressions in Equations 6 Equation 1, keeping in mind that the di] are into Equation 5 and collecting the terms in sums of branch lengths. Suppose that we dkl and d\\, the coefficient of d£, in Q — R is were to regard the lengths of the branches wkl - wa - wn connected to two adjacent tips, i and ;, as constants and consider minimizing Q with and the coefficient of dkl is 104 SYSTEMATIC BIOLOGY VOL. 46 -2wklDk 2waDa 2wnDn 0 in any of the three possible directions, one finds a rooted bifurcating tree. In any rooted bifurcating tree having two or more If the difference between Q and R is to be tips, there are always at least two tips a constant independent of the lengths of branching from the same interior node: in the branches remaining on the pruned tree, Figure 1 tips i and ;' are two of these, but then the above coefficients must all be three other such pairs are also visible. zero. Equating them to zero, we can solve Equations 7, 8, and 10 permit us to reduce for the wkl and the Dkl: the size of the problem by removing two wkl = wa + (7) tips and creating one new one while not affecting the values of the least squares es- and timates of the lengths of the remaining branches. We can continue this process, re- •• (8) peatedly applying these equations until only three tips are left, all connected di- Because the terms in dkl have been elim- rectly to the designated interior node inated from Q — R, we have now elimi- (node 0). Designate these three nodes as a, nated all terms that could contain any of b, and c. the branch lengths other than v( and vr The The problem is now reduced to mini- difference between Q and R must then be mizing the sum of squares a constant not containing the lengths of any branches other than i and /'. We can Q = K + wab(Dab - v a - vby find it by collecting all the terms in Equa- + wbc(Dbc - vb - vcf tion 5 that do not have dkl in them: + zvac(Dac - va - Vc)2. (11) As Farris (1972) pointed out, the mini- mum of Equation 11 is achieved when the rightmost three terms become zero when (9) the branch lengths va, vb, and vc are chosen to satisfy the equations and using Equations 7 and 8, Equation 9 can be reduced after some algebra to vb Dbc = vb vc Q-R = w,,-D? + 2 Wi/W// (12) us Vc, X[(Dfl - *,-) - (D;7 - *,.)]2 these being the normal equations that are derived by differentiating Q with respect -s-(wfl + wn), (10) to the branch lengths and equating the deriv- which can never be negative. atives to zero. The solution (Farris, 1972) is Equations 7, 8, and 10 give us the means v. = (Dab + Dac - Dbc)l2 of reducing the size of the problem from n vb = (Dab + Dbc - Dac)/2 tips to n — 1 tips, dropping two branch vc = (Dac + Dbc - DJ/2. (13) lengths from the minimization problem. This process can be used to construct an Having used Equations 7 and 8 to com- alternating least squares method for mini- pute the quantities wab, wbc, wac, Dab, Dbc, and mizing Q. Dac, we have minimized the sum of squares Q with respect to the three branch Iteration of Branch Lengths lengths va, vb, and vc, holding all the other Consider only the case of an unrooted branch lengths constant. It follows that Q bifurcating tree, which has each interior is nonincreasing during each such step. node connected to three neighbors. As can If we follow the strategy of moving be seen in Figure 1, if one proceeds out- through the tree, taking each interior node wards from an internal node such as node of the tree in turn, pruning the tree to re- 1997 FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES 105 duce the problem to minimization of Q These are connected into an unrooted tree with respect to the three branch lengths in- (only one topology is possible), and Equa- cident on that node, and finding the opti- tions 13 are used to solve for least squares mum lengths for those three branches, the branch lengths. The program then consid- successive values of the sums of squares Q ers where the fourth species can be added cannot increase: they will form a monoton- to the tree. There are three possible places ic nonincreasing sequence bounded below that a new internal node could be added, by zero. The sequence of values of Q must with the fourth species connected to it, and thus converge; it seems a reasonable ex- these are on the interiors of the three pectation that the sequence of values of the branches. Each of these positions is tried branch lengths will also converge. in turn, and the least squares branch The iteration at each stage sets the deriv- lengths are found for each of the resulting atives of Q with respect to three of the topologies. The program accepts that branch lengths to zero. When further iter- placement of the species that results in the ation through all interior nodes of the tree smallest sum of squares. produces no change, it follows that the de- The algorithm continues in this fashion, rivatives of Q with respect to all branch adding each species to all possible places lengths are zero, so that we have reached in the tree and picking the placement that a stationary point. We cannot guarantee minimizes the sum of squares after itera- from this that the stationary point is a min- tive computation of the least squares imum. However, a glance at Equation 1 branch lengths. It is convenient to start the shows that if the wi] are nonnegative the iteration for each topology by calculating quadratic form Q cannot ever be negative; the lengths of the branches incident upon this fact is sufficient to guarantee that no the newly introduced node, because that saddle points or maxima can exist. We provides starting values for their lengths. have not ruled out that there could be di- The iteration for each topology proceeds rections in which the sum of squares might by traversing the tree outwards from that be unchanging. This is the case where the node, optimizing branch lengths for each quadratic form is positive semidefinite. In internal node encountered. Although it such a case the algorithm will reach one of would perhaps be best to repeat the tra- the tied points along the line (or plane) of versal until the sum of squares stopped equally good solutions. We can in any case changing, for the data sets I have encoun- guarantee that we have minimized the tered four passes through the tree is quite quadratic form of Equation 1. sufficient. An alternative scheme would of course After each species is added to the tree, be to set up and solve the linear equations except the fourth species, a series of local that are the normal equations for the least rearrangements is carried out to see if the squares problem. The present method tree can be improved. A local rearrange- amounts to an iterative approach to solv- ment switches the order of adjacent ing these equations, differing from conven- branches in the tree. In the present version tional iterative approaches such as Gauss- of the program all local rearrangements Seidel iteration by changing the variables are tried after each species is added. If any three at a time instead of one at a time. The local rearrangements improve the tree, as present method also avoids directly setting judged by the sum of squares, the rear- up the equations and takes advantage of rangement process is continued until no the sparseness of the coefficient matrix in further improvement by local rearrange- a natural way. It also enables us to main- ment is possible. tain the constraint that branch lengths not be negative, if that is desired. The user has the option of specifying that, after the last species is added to the The Algorithm tree, the last bout of rearrangements The strategy used in the program should be global. In that case, each subtree FITCH starts with the first three species. is removed from the tree and reinserted in 106 SYSTEMATIC BIOLOGY VOL. 46 all possible places, the best of these being A = [wJP* - vb) + wac(Dac - vc)] chosen and the process continued until no +(wab + wac). (14) further improvement results. Swofford and Olsen (1990) described this rearrangement If the minimum occurs at a negative strategy as subtree pruning and regraft- branch length, then the nonnegative value ing. At that point, we have a tree that can- of that branch length that minimizes Equa- not be improved upon by moving any sin- tion 11 will be zero, because Equation 11 gle group. This does not guarantee that we is a quadratic in each of its three variables have found the best topology, but it gives and has positive curvature in each. Thus, us some reassurance that we have at least when a branch length computed from made a serious attempt to find it. Equation 14 becomes negative, it is instead set to zero. The branch lengths v^ vb, and In PAUP*, the strategy for searching vc are determined in this way in turn until among tree topologies is different and will there is no further change. Although this not be described here. procedure is not guaranteed always to find the best nonnegative values of the branch Avoiding Negative Branch Lengths lengths, it seems to do quite well in prac- I have argued elsewhere (Felsenstein, tice. 1984,1986) that it is appropriate to find the Complexity of the Computation least squares tree among all those having no negative branch lengths. The present al- In calculating how much effort is in- gorithm does not avoid negative branch volved in the computation, we must con- lengths; it is entirely possible for one or sider how many topologies of each given more of the solutions to Equations 13 to be size are examined and how much effort is negative. However, it is quite easy to alter involved in optimizing the branch lengths the algorithm to avoid negative branch for each topology. We start with three spe- lengths; this is what is done by default in cies, adding the fourth in each of three FITCH because allowing negative branch possible places. When we add the nth spe- lengths is an option. cies, there are In - 5 possible places to When finding the optimum values of the add it. There are also, for n > 4, 2n — 6 three branch lengths around an interior local rearrangements that will be tried, as- node, suppose we require that Equation 11 suming that none of them causes the tree be minimized without allowing any of v^ to be altered, which in turn would require additional rounds of rearrangement. The vb/ or vc to become negative. If the solution addition of the nth species will thus cause to Equations 13 does not find any negative 4n - 11 evaluations of the least squares branch length, then there is no problem. If branch lengths. one or more of the branch lengths is neg- The evaluation of branch lengths on a ative, in principle we should examine all tree requires four passes through the tree. seven possible patterns of zero branch Each of these prunes the tree n — 3 times lengths in which one, two, or all three of if it possesses n tips, and each time it is the branch lengths are zero. pruned the distances of the central node to It would be possible to minimize Equa- all other nodes are updated. When there tion 11 for each and pick the best solution, are m nodes remaining on the tree as it is but I have followed a simpler and less ex- being pruned, this updating requires on act strategy. Any of the branch lengths that the order of 2m — 3 operations. have become negative are set to zero, and Putting all of this together, one can show the resulting values are taken as a starting that estimating a tree and its branch point. Each of the three branch lengths is lengths requires on the order of n4 opera- then considered in turn. Fixing the values tions. This means that the algorithm will of the other two, the least squares solution be fairly slow for large numbers of tips. for vn is For 20 species, FITCH required 46.9 sec to 1997 FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES 107 TABLE 1. The immunological distance data set of Sarich (1969) symmetrized. Dog = Canis familiaris; bear = Ursus americanus; raccoon = Procyon lotor; weasel = Mustela vison; seal = Phoca vitulina richardii; sea lion = Zalophus californicus; cat = Felis domestica; monkey = Aotus trivirgatus. These data were used to construct the tree given in Figure 2. Dog Bear Raccoon Weasel Seal Sea lion Cat Monkey Dog 0 32 48 51 50 43 98 148 Bear 32 0 26 34 29 33 84 136 Raccoon 48 26 0 42 44 44 92 152 Weasel 51 34 42 0 44 38 86 142 Seal 50 29 44 44 0 24 89 142 Sea lion 48 33 44 38 24 0 90 142 Cat 98 84 92 86 89 90 0 148 Monkey 148 136 152 142 142 142 148 0 execute on my DECstation 5000/125, a ma- Consider the data set of Sarich (1969), chine whose SPECfp92 rating is approxi- which is reproduced in Table 1. Running mately 25 (so that it is approximately as FITCH on this data set without allowing fast as a 486DX / 66). For 40 species (the 20 negative branch lengths, using the Fitch- species were a random sample from these Margoliash criterion, gives the tree shown 40), it took 730 sec, which is 15.565 times in Figure 2, which is rooted using the longer, very close to the expected multi- monkey as an outgroup. The weighted plier of 24 = 16. sum of squares for this tree is 0.06996. Fig- ure 3 shows the progress of branch length A NUMERICAL EXAMPLE iteration when we take this tree topology We can get some feel for the progress of and initially set all branch lengths equal to the iteration by watching it in an example. 7.53 seal sea lion 100 - 387 weasel Iteration FIGURE 3. Change in branch lengths over eight monkey passes through the tree in Figure 2, starting with ar- bitrary initial values. The lengths essentially reach FIGURE 2. The least squares estimate for the Sarich their final values after four passes. Because each iter- (1969) data set, using the Fitch-Margoliash criterion ation consists of successive changes around each of and the algorithm described in this paper. Horizontal the six interior nodes, each time interval is divided distance is proportional to branch length. The tree has into sixths and each change is shown at the point that been rooted with monkey as the outgroup. it occurred. 108 SYSTEMATIC BIOLOGY VOL. 46 1. Eight passes through the tree are shown; Xenopus the branch lengths essentially cease chang- Latimeria ing after the first four. Sebastolobus '— Squalus As expected, the iteration succeeds rap- Petromyzon idly. For moderately clean data, the itera- • Myxine tion computes the internal branch lengths — Branchiostoma using Equations 13, in effect using weight- • Styela ed averages of the distances between ' Herdmania members of the three different subtrees Saccoglossus' defined by an internal node. Once the tree Ophiopholis approaches its final branch lengths, if there Strongylocentrotus is not serious internal conflict in the data Placopecten these weighted averages depend rather lit- Limicolaria tle on the details of the branch lengths Eurypelmia within the subtrees, and hence the branch - Tenebrio lengths rapidly stabilize. It is the very in- FIGURE 4. The tree produced by FITCH on the 16S dependence of evolution in the different rRNA data set of Turbeville et al. (1994), when dis- subtrees, the very "treeness" of the data, tances are computed by the Kimura (1980) two-pa- rameter method with a transition/transversion ratio that ensures the success of this alternating of 2.0. The alignment and selection of sites used was least squares approach. that recommended by Turbeville et al. AN EXAMPLE WITH SEQUENCE DATA To give a better feel for how the present algorithm copes with real data, I analyzed lengths to be nonnegative, is the innovative the metazoan ribosomal RNA (rRNA) data method of De Soete (1983), which starts set of Turbeville et al. (1994), which in- with the observed distances and then cludes 16 species, many of them chordates gradually brings them closer to additive or their near relatives. These data can be tree distances while searching for the least retrieved from the alignments section of squares fit. This method is quite different the EMBL database as alignment DS16914. from the present approach, which searches Both their analysis and this one used only in the space of additive trees. De Soete's a subset of better aligned sites. FITCH method approaches that space from out- took 17.6 sec to analyze these distances on side. Therefore, it may carry out a far more a 486DX / 33 running the Linux operating effective search for the tree topology that system. On a Digital Alphastation 400 leads to the minimum sum of squares. 4/233, it took 1.46 sec. The tree (Fig. 4) is The most widely used implementation identical to the tree produced by Turbe- of De Soete's method is probably Isadt, a C ville et al. using the same program. In the program by Michael Maciukenas distrib- study by Turbeville et al., the trees pro- uted in Stephen Smith's Genetic Data En- duced by FITCH were similar to those pro- vironment (GDE) package of programs for duced by neighbor joining and by likeli- DNA analysis. To test whether the present hood and to those produced by parsimony method had any advantages over Isadt, I analysis of a combined molecular and simulated the evolution of 20 DNA se- morphological data set, and these trees all quences on a tree. The tree was generated supported the existence of the Chordata. by a branching process, with the rate of These trees differed from the tree pro- branching of a lineage being 1 per unit duced by parsimony analysis of the molec- time. Branching was continued until the ular data alone. The utility of a comparison 21st lineage was just about to be produced, based on a single case is, however, limited. and then the process was stopped. DNA sequences 300 bases in length were COMPARISON WITH DE SOETE'S METHOD stochastically evolved along the resulting An approach to searching over additive trees. Each branch had an expected rate of trees, constraining the estimated branch change per unit time of either 0.066667 or 1997 FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES 109 0.2, these values being chosen with equal probability. Thus the resulting trees were not ultrametric when the branch lengths relevant to base change are considered. All sites changed with equal probabilities, and a Kimura two-parameter model (Kimura, 1980) was used, with an instantaneous transition/trans version ratio of 2. One hundred trees and data sets were produced by this simulation. Of these, 30 J had at least two identical DNA sequences FIGURE 5. Tree form used in the computation of on them. The resulting distances in those the lengths of branches i and ; in the neighbor-joining cases caused Isadt to terminate with an er- method. ror message. The remaining 70 distance matrices were analyzed with both pro- grams. The sum of squares of the fit from pologies and/or better branch lengths than the Isadt results is not provided by that can De Soete's method, in spite of the con- program; it was computed using the esti- straint on its search algorithm to stay with- mated trees that emerged from Isadt as in the space of additive trees. It is not clear user trees whose branch lengths were not how other implementations of De Soete's to be reestimated in a run of FITCH. In method would perform in these compari- analysis of the distance matrices by sons. FITCH, the unweighted least squares method of Cavalli-Sforza and Edwards RELATIONSHIP TO THE NEIGHBOR-JOINING (1967) was used because Isadt was also us- METHOD ing that unweighted criterion. The neighbor-joining method of Saitou In every case, Isadt was at least 10 times and Nei (1987) has become popular be- faster than FITCH. Of the 70 distance ma- cause of its speed: its execution time is pro- trices on which both methods were run, 4 portional to the cube of the number of spe- had results that were identical to five dec- cies. Simulation studies (Kuhner and imal places. In all 66 of the matrices for Felsenstein, 1994) have shown it to be near- which the results were different, FITCH ly as effective as the Fitch-Margoliash gave a lower sum of squares. For many of method in recovering the true phylogeny. these, the differences were small, indicat- It estimates the lengths of branches to two ing that some accuracy may have been sac- tips that are "neighbors" on the tree and rificed for speed in Isadt. But for 25 of the then removes these and replaces them 66 matrices, the Isadt sum of squares was with a new tip. Distances are calculated >1% greater, and for 5 matrices it was from the new tip to all other tips currently >10% greater. When the trees were exam- on the tree. Saitou and Nei showed that the ined more closely, for 24 matrices the tree step that estimates the branch lengths of topologies estimated by Isadt and FITCH two neighbors makes a least squares esti- differed. In none of these was the differ- mate, by the unweighted criterion of Cav- ence due to rearrangement of zero-length alli-Sforza and Edwards (1967) of the branches. For 11 of these 24 matrices, the branch lengths v{ and vjf for a tree that has sum of squares differed by >1%, and for i and j as neighbors but has all the other 2 matrices it differed by >10%. Thus, for tips that remain on the tree branching 14 matrices, two trees of the same topolo- from a multifurcating node. Figure 5 gy differed in sum of squares by >1%, and shows the tree topology to which this least for 3 matrices the sum of squares differed squares estimate applies. Neighbor joining by >10%. may thus be regarded as an approximation These results suggest that the present to the least squares algorithm of Cavalli- method can sometimes find better tree to- Sforza and Edwards (1967). It differs from 110 SYSTEMATIC BIOLOGY VOL. 46 the present method in that having settled the sum of the lengths of these branches. on branch lengths vt and vf it never returns The present algorithms can thus be used to that part of the tree to reestimate them. to do the branch length calculations in a Saitou and Nei's algorithm "prunes" the minimum evolution method. It is thus not tree. Its recalculation of distances from hard to modify a program that infers phy- node k to each remaining tip closely par- logenies by least squares to make one that allels the current method: it uses infers them by minimum evolution. This option will be provided in future versions Du = (Du + Dn - DJ/2. (15) of FITCH. Because in neighbor joining, as in the cur- SUMMARY rent method, for two neighbors on the tree A; = vi + v)i w e c a n substitute this equa- In this paper, I have not described a new tion into Equation 15 and rearrange it into distance matrix method. Instead, I have in- the case of Equation 8 for which wa = w]X troduced a new computational framework = 1, as is true for the Cavalli-Sforza and for the long-existing least squares family of Edwards criterion. distance matrix methods. This framework, In the neighbor-joining method, the val- by making effective use of the structure of ues of vt and vf are determined when the the tree, allows us to maintain a constraint tree is as shown in Figure 5. Because the of nonnegativity of branch lengths. The al- values are identical to those one would get gorithm "prunes" the sum of squares on from Cavalli-Sforza and Edwards's un- the tree in a natural way. This process may weighted least squares criterion, they are be helpful for other tree calculations that also identical to the values that the present use least squares. It can serve as the basis algorithm would give to v{ and vf if itera- for either least squares estimation of the tion was done with the tree structure of tree or minimum evolution estimation and Figure 5. The relative success of the neigh- has some relationship to the neighbor-join- bor-joining algorithm in approximating ing method. When used in connection the least squares solution to a completely with local rearrangement of the tree, it resolved tree suggests that the estimate is seems more effective, if slower, than De not very sensitive to the details of the res- Soete's innovative algorithm for searching olution of the multifurcation in Figure 5. the space of trees. It forms the basis of the David Swofford (pers. comm.) has pointed least squares distance matrix calculations out that this may also be the reason for the in the FITCH program of PHYLIP and in rapidity with which the present algorithm PAUP*. Having been in distribution in converges, as shown in Figure 3. Because PHYLIP since 1982, it has been used to vt and Vj are not very sensitive to the other compute most of the least squares trees details of the structure of the tree, they that have been published in the systemat- reach reasonable values very rapidly. ics and molecular evolution literature. We may thus regard the neighbor-join- ACKNOWLEDGMENTS ing method as a quick and fairly accurate I thank David Swofford for illuminating comments approximation to the unweighted least in discussion of these matters and John Huelsenbeck, squares method. Peter Beerli, and Mary Kuhner for corrections of the manuscript and helpful suggestions. This work was RELATIONSHIP TO THE MINIMUM supported in part by task agreement number EVOLUTION METHOD DE-AT06-76EV71005 of contract number DE-AM06- 76RLO2225 between the U.S. Department of Energy The minimum evolution distance matrix and the University of Washington, Jby NSF grant num- method (Kidd and Sgaramella-Zonta, bers BSR-8614807, DEB-9207558, and BIR-9527687, and by NIH grants 2 R55 GM41716-04 and 1 R01 1971; Rzketsky and Nei, 1992) searches GM51929-01. among tree topologies. For each tree to- pology, it evaluates branch lengths by least REFERENCES squares fitting. The topology is evaluated CAVALLI-SFORZA, L. L., AND A. W. F. EDWARDS. 1967. not by the overall sum of squares but by Phylogenetic analysis: methods and estimation pro- 1997 FELSENSTEIN—FITTING TREES TO DISTANCES BY ALTERNATING LEAST SQUARES 111 cedures. Evolution 21:550-570. [Am. J. Hum. Genet. for estimating and testing minimum-evolution Suppl. 19:233-257.] trees. Mol. Biol. Evol. 9:945-967. DE SOETE, G. 1983. A least squares algorithm for fit- SAITOU, N., AND M. NEI. 1987. The neighbor-joining ting additive trees to proximity data. Psychometrika method: A new method for reconstructing phylo- 48:621-626. genetic trees. Mol. Biol. Evol. 4:406-425. FARRIS, J. S. 1972. Estimating phylogenetic trees from SARICH, V. M. 1969. Pinniped origins and the rate of distance matrices. Am. Nat. 106:645-668. evolution of carnivore albumins. Syst. Zool. 19:286- FELSENSTEIN, J. 1984. Distance methods for inferring 295. phylogenies: A justification. Evolution 38:16-24. SWOFFORD, D. L. 1996. PAUP*: Phylogenetic analysis FELSENSTEIN, J. 1986. Distance methods: reply to Far- using parsimony (*and other methods). Sinauer, ris. Cladistics 2:130-143. Sunderland, Massachusetts. FITCH, W. M., AND E. MARGOLIASH. 1967. Construc- SWOFFORD, D. L., AND G. J. OLSEN. 1990. Phylogeny tion of phylogenetic trees. Science 155:279-284. reconstruction. Pages 411-501 in Molecular system- KIDD, K. K., AND L. A. SGARAMELLA-ZONTA. 1971. atics, 1st edition (D. M. Hillis and C. Moritz, eds.). Phylogenetic analysis: Concepts and methods. Am. Sinauer, Sunderland, Massachusetts. J. Hum. Genet. 23:235-251. TURBEVILLE, J. M., J. R. SCHULZ, AND R. A. RAFF. 1994. KIMURA, M. 1980. A simple model for estimating evo- Deuterostome phylogeny and the sister group of the lutionary rates of base substitutions through com- chordates: Evidence from molecules and morphol- parative studies of nucleotide sequences. J. Mol. ogy. Mol. Biol. Evol. 11:648-655. Evol. 16:111-120. WOLD, H. 1966. Nonlinear estimation by iterative KUHNER, M. K., AND J. FELSENSTEIN. 1994. A simu- least squares procedures. Pages 411—444 in Research lation comparison of phylogeny algorithms under papers in statistics. Festschrift for J. Neyman (F. N. equal and unequal evolutionary rates. Mol. Biol. David, ed.). John Wiley and Sons, London. Evol. 11:459-468 (erratum 12:525, 1995). Received 6 March 1995; accepted 1 October 1996 RZHETSKY, A., AND M. NEI. 1992. A simple method Associate Editor: Richard Olmstead

DOCUMENT INFO

Shared By:

Categories:

Tags:
phylogenetic trees, inferring phylogenies, distance matrix, minimum evolution, pairwise distances, branch lengths, least squares, ultrametric inequality, phylogenetic tree, data sets, data set, confidence sets, confidence set, tree metric, weighted least-squares

Stats:

views: | 26 |

posted: | 2/18/2010 |

language: | English |

pages: | 11 |

OTHER DOCS BY qok10781

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.