Points of View
Syst. Biol. 51(3):524–527, 2002
Type 1 Error Rates of the Parsimony Permutation Tail Probability Test
M ARK WILKINSON,1 P EDRO R. P ERES -N ETO ,2 PETER G. FOSTER,1
AND CLIVE B. M ONCRIEFF1
Department of Zoology, The Natural History Museum, London SW7 5BD, UK;
Department of Zoology, University of Toronto, Toronto, Ontario M5S 3G5, Canada
Archie (1989) and Faith and Cranston randomization tests use the length of the
(1991) independently developed a parsi- most-parsimonious trees (MPTs) as the test
mony-based randomization test for assess- statistic, comparing this for real and ran-
ing the quality of a phylogenetic data ma- domly permuted data. A corresponding sim-
trix. Matrix randomization tests have had ple test statistic for the null hypothesis that
a mixed reception from phylogeneticists the data are indistinguishable from random
(e.g., Kallersjo et al., 1992; Alroy, 1994; is the parsimony permutation tail probabil-
Carpenter et al., 1998; Wilkinson, 1998; ity or parsimony PTP (Faith and Cranston,
Siddall, 2001). In general, however, these are 1991). The parsimony PTP is the proportion
well-founded statistical techniques (Manly, of data sets (real and randomly permuted)
1991) that may be well-suited to phylogenetic that yield MPTs as short or shorter than the
contexts where models or assumptions un- MPTs for the original data.
derlying parametric statistical methods are Slowinski and Crother (1998) used 40 real
either dif cult to justify or to test. In a ma- data sets in an empirical evaluation of the
trix randomization test, a test statistic (typ- utility of the parsimony PTP. Speci cally,
ically a measure of data “quality”) is cal- they compared PTPs with the fraction of
culated for the original data, and the result clades supported by bootstrap proportions
is contrasted against a null distribution of exceeding 50%. In addition, they compared
the test statistic determined by repeated ran- PTPs with the resolution of strict compo-
domization of the data. Randomization is by nent consensus trees. They reported that
random permutation of the assignment of data sets that appear to be poorly struc-
character states to taxa within each character. tured, based on bootstrap analyses or be-
Essentially, each character in the dataset is in- cause they have a poorly resolved strict
dependently shuf ed so that congruence be- component consensus, tend to have signi -
tween characters is reduced to the extent that cant PTPs, and they concluded that (p. 300)
would be expected by chance alone. The ran- “the PTP test is too liberal” and is of lim-
dom permutation preserves some features ited utility. Peres-Neto and Marques (2000)
of the data that are known to affect mea- expressed concern at the use of one statis-
sures of data quality, such as the total num- tical test (the bootstrap) to evaluate another
ber of characters and taxa and the num- (parsimony PTP) and presented simulation
bers of taxa with each character state within studies that attempted to address the per-
each character (Archie, 1989; Sanderson formance of the PTP test more directly.
and Donoghue, 1989; Faith and Cranston, Their simulation studies involved perform-
1991). Thus the null distribution represents ing PTP tests on randomly generated data.
a distribution that one would expect from Because data are generated randomly, the
comparable phylogenetically uninformative null hypothesis is true and the number of
data. The simplest parsimony-based matrix times that the null hypothesis is rejected
2002 POINTS OF VIEW 525
correctly estimates the Type 1 error rate 40 characters for two cases, equiprobability
of the PTP test, that is, the probability of of the states, and probabilities of 0.65 and 0.35
wrongly rejecting the null hypothesis when for states 0 and 1, respectively.
it is true. On the basis of their simula- Simulations were carried out in two ways.
tions, Peres-Neto and Marques (2000) re- First they were run with a corrected ver-
ported unacceptably high Type 1 error rates sion of the software employed by Peres-Neto
for the parsimony PTP (e.g., >0.4 for nom- and Marques (2000), which used Hennig 86
inal ® D 0.05). These results, if correct, (Farris, 1988) to perform parsimony analysis;
would undermine the utility of this par- for this, we used the exact (ie) algorithm.They
simony randomization test and led Peres- were also run with independently devel-
Neto and Marques (2000:423) to suggest, oped software using PAUP* (Swofford, 1998)
“Perhaps it is time to propose new tools to perform the parsimony PTP tests with
for assessing character covariation in phy- heuristic searches (10 random addition se-
logenetic data.” However, we discovered a quences and TBR branch swapping). All
mistake in the code they used to generate simulations were replicated by using both
the “random” data, which invalidates the systems.
results of their study. Here we report more The addition of an all-zero outgroup to the
accurate Type 1 error rates for the parsi- randomly generated data sets was intended
mony PTP test, estimated by the simulation solely to emulate the original study; we do
method of Peres-Neto and Marques (2000), not consider this an essential part of the simu-
for the range of parameter combinations lation process. Given that the “ingroup” data
they originally considered. Our results in- were randomly permuted, the unpermuted
dicate that the parsimony PTP is a conser- outgroup would always be random with re-
vative test of the null hypothesis, thus un- spect to the ingroup. Simulations performed
derlining the potential utility of the test in without the addition of an outgroup yield
phylogenetics. similar results (not shown).
M ATERIALS AND M ETHODS R ES ULTS
Type 1 error rates were estimated by using Results from the parallel tests using differ-
the simulation approach of Peres-Neto and ent software were concordant and have been
Marques (2000). For a given set of parame- combined to increase sample size. Type 1 er-
ters, multiple data sets were generated ran- ror rates of the parsimony PTP, the propor-
domly and tested, and the proportion of tests tions of tests of randomly generated data
yielding results signi cant at ® D 0.05 was yielding signi cant results (PTP · 0.05), for
determined. The range of parameters used in each of the three sets of simulations are
the simulations followed that of Peres-Neto shown in Table 1. The Type 1 error rates are
and Marques (2000). In the rst set of simu-
lations, 200 random binary data states were TABLE 1. Type 1 error rates of the parsimony PTP
generated, with the two states (0 and 1) be- measured as the proportions of trials of randomly gen-
erated data sets yielding PTPs · 0.05 by using (A) bi-
ing equally likely for each of two values of nary characters and equiprobable character states, (B)
the number of characters (40 and 80) and four-state characters and equiprobable character states,
six values of the number of terminal taxa and (C) 40 binary characters and differing probabili-
(increments of 5 up to 30). An all-zero out- ties of states 0 and 1. (A) and (B) used 400 trials; (C)
group was added in each random matrix, and used 2,000. These results correspond to those reported in
Figures 1–3 of Peres-Neto and Marques (2000).
a parsimony PTP test was performed using
1,000 permutations. The outgroup was in- No. of No. of Probability of
cluded in tree length estimation but not in the characters characters state 1
permutation. The second set of simulations Taxa 40 80 40 80 0.5 0.35
differed only in the random data consisting
5 0.045 0.018 0.013 0.038 0.039 0.031
of four equiprobable states (0, 1, 2, 3) that 10 0.033 0.050 0.025 0.038 0.037 0.037
were treated as unordered. The third set of 15 0.038 0.033 0.033 0.045 0.035 0.028
simulations explored unequal frequencies of 20 0.028 0.038 0.035 0.038 0.030 0.036
character states, using a larger sample (1,000) 25 0.053 0.063 0.045 0.040 0.043 0.049
of randomly generated binary data sets of 30 0.030 0.045 0.073 0.048 0.033 0.033
526 S YSTEMATIC BIOLOGY VOL. 51
generally low, in contrast to the high values any obvious mechanism that would account
reported previously. Indeed, it is striking that for the reported results. We have no rea-
the rates are lower than the expectation of son to expect truly random data to gener-
5% in the large majority of the simulations. ate anything greater than the 5% Type 1 er-
There are no obvious differences in the per- ror rate when the statistical test is based on
formance of the test across the sampled com- random permutation. This is because what
binations of parameters. constitutes the “real” data is simply a ran-
dom choice from the set of all its possible
D ISCUSS ION An interesting aspect of our results that de-
The results previously reported by Peres- mands explanation is the relatively low error
Neto and Marques (2000) appear to support rates. The reason for this deviation from ex-
a pessimistic or even nihilistic view of the pectations is explained by the discontinuous
parsimony PTP test. However, that view is nature of the distribution of the test statis-
illusory and results from an unfortunate er- tic. Because parsimony tree lengths are not
ror in the code used to generate random data continuous, there is no need for a clear break
in their study. Brie y, the error made the between the shortest 5% of the tree lengths
probability of assigning a particular charac- and the longest 95% of the tree lengths.
ter state contingent on the state previously Rather, some tree lengths may be clustered
generated. Character data were generated for on the threshold such that<5% of the tree
each “species” in turn creating sequences of lengths will be shorter than the threshold. In
states that were often more similar or more such cases, the observed tree length would
different between species than one would ex- need to be among these (<5%) shortest tree
pect by pure chance. The effect of the error lengths to be signi cant, the Type 1 er-
would be expected to increase with numbers ror rate will be <5%, and the test will be
of species, which explains the nding that re- conservative.
jection of the null hypotheses became easier To assess this possibility we developed a
as the number of species increased (Peres- revised version of the test that expressly ac-
Neto and Marques, 2000:Figs. 1 and 2). This counts for this source of error. A test on dis-
also explains the reported increase in Type 1 crete data is not fundamentally distinct from
error rates with an increase in the proportion a test based on continuous data grouped
of one of the character states, which increased into discrete bundles. Imagine that mem-
the chance of generating similar sequences of bers of the bundle clustered on the threshold
character states. all have “true” test value (based on an un-
In direct contrast to the original results, derlying continuous distribution) that range
our estimates indicate that the parsimony uniformly from the value associated with a
PTP is a relatively conservative test statis- “conservative” test to that associated with
tic. Over a range of numbers of taxa, a “liberal” test (i.e., including and exclud-
characters, character states, and relative pro- ing the entire bundle). The midpoint of this
portions of character states the Type 1 er- range now corresponds to the threshold, so
ror rate is mostly <5%. In only 3 of the 36 that one half of the members of the bun-
simulations did it exceed this error rate, and dle are considered to be among the 5%
in each case only marginally so. In none of of values in the tail of the overall distri-
the 12 simulations using 2,000 trials did the bution. When we applied this method of
error rate exceed 5%, thus suggesting that calculating the test statistic, the previous
the three outliers are attributable to sampling “conservative” nature of the test results dis-
error. appeared entirely, and the test statistics nom-
Our results demonstrate that the parsi- inally set at ® D 0:05 clustered very close to
mony PTP test cannot be considered too lib- the nominal value. Indeed, the whole dis-
eral because of any unacceptably high Type 1 tribution of the test-statistic approximated
error rate. The present results also make very well to the expected uniform dis-
more intuitive sense than those reported in tribution associated with P-values. Note
the original study. Indeed, the error in the that we would expect a matrix randomiza-
original study was discovered because the tion test using test statistics better approxi-
current authors were unable to conceive of mated by a continuous distribution (such as
2002 POINTS OF VIEW 527
log-likelihoods) to be almost unbiased (i.e., considered problematic, then the revised ver-
not conservative). sion of the test we have described can be
Matrix randomization tests such as the used.
parsimony PTP seem to have two uses or in-
terpretations. In the rst, the emphasis is on
the failure to reject the null hypothesis as a
justi cation for discounting phylogenetic re- Thanks are due to Joe Thorley for enlightening
discussion of the importance of tree length distribu-
lationships based on the data. Passing the test tions being discontinuous and to Mike Sanderson and
is seen as a minimum requirement of data if an anonymous reviewer for suggesting improvements
one is to invest any con dence at all in the to the manuscript. P. R. P.-N. was funded by a CNP-q
phylogenetic relationships inferred from it fellowship.
(e.g., Alroy, 1994; Wilkinson, 1997, 1998). In
the second, the emphasis is on passing the
test as justi cation for placing con dence in
the results of what phylogenetic inferences ALROY, J. 1994. Four permutation tests for the pres-
ence of phylogenetic structure. Syst. Biol. 43:430–
are based on the data (e.g., Lee, 2001). The 437.
rst approach is the more conservative and, ARCHIE, J. W. 1989. A randomisation test for phylo-
we believe, more reasonable. It is not known genetic information in systematic data. Syst. Zool.
how much phylogenetic signal is required 38:239–252.
CARPENTER , J. M., P. A. GOLOBOFF, AND J. S. FARRIS .
of data for them to pass the parsimony PTP 1998. PTP is meaningless, T-PTP is contradictory: A
test or whether this level of signal is suf - reply to Trueman. Cladistics 14:105–116.
cient for accurate phylogenies to be expected. FAITH, D. P., AND P. S. CR ANS TON. 1991. Could a clado-
Type 2 error rates remain unexplored. How- gram this short have arisen by chance alone? On
ever, Slowinski and Crother’s (1998) compar- permutation tests for cladistic structure. Cladistics
isons with bootstrapping suggest that pass- FARRIS , J. S. 1988. Hennig86, version 1.5. Distributed by
ing the parsimony PTP cannot generally be the author, Port Jefferson Station, NewYork.
assumed to guarantee well-supported phylo- FU, J. Z., AND R. W. MURPHY. 1999. Discriminatin g and
genetic hypotheses. Certainly, phylogenetic locating character covariance: An application of per-
mutation tail probability (PTP) analysis. Syst. Biol.
signals may not be uniformly distributed 48:380–395.
across a data matrix, and the fact that a given ¨ ¨
KALLER SJ O , M., J. S. FARRIS , A. G. KLUGE, AND C. BULT .
data matrix passes the test does not entail 1992. Skewness and permutation. Cladistics 8:275–
that subsets of it would similarly pass the test 287.
(Faith and Cranston, 1991; Fu and Murphy, LEE, M. S. Y. 2001. Molecules, morphology and the mono-
phyly of diapsid reptiles. Contrib. Zool. 70:1–22.
1999). In addition, many data sets with non- MANLY, B. F. J. 1991. Randomization and Monte Carlo
phylogenetic (but nonrandom) structure are methods in biology. Chapman and Hall, London.
likely to pass the test (Kallersjo et al., 1992; PERES -NETO , P. R., AND F. MARQUES. 2000. When are
Alroy, 1994). Thus we cannot reasonably in- random data not random, or is the PTP test useful?
fer that the data passing a PTP test support SANDERSON, M. J., AND M. J. DONOGHUE. 1989. Patterns
well-founded inferences or even are “phylo- of variation in levels of homoplasy. Evolution 43:1781–
genetically well structured” (e.g., Lee, 2001). 1795.
Other approaches should be used to inves- SIDDALL, M. E. 2001. Computer-intensive randomiza-
tigate the strength of support for relation- tion in systematics. Cladistics 17:S35–S52.
SLOWINSKI , J. B., AND B. I. CROTHER. 1998. Is the PTP
ships inferred from data that have passed test useful? Cladistics 14:297–302.
the parsimony PTP test. From a conserva- SWOFFORD, D. L. 1998. PAUP*. Phylogenetic analysis
tive perspective, where avoiding ill-founded using parsimony (and other methods. Version 4.0.
hypotheses of relationships is deemed most Sinauer Associates, Sunderland, Massachusetts.
WILKINSON, M. 1997. On phylogenetic relationships
important, the possibility that the parsimony within Dendrotriton (Amphibia: Caudata: Plethodon-
PTP may be slightly conservative is not a tidae): Is there suf cient evidence? Herpetol. J. 7:55–
problem. Given that the use of the PTP is 65.
to protect us from poorly founded infer- WILKINSON, M. 1998. Split support and split con ict ran-
ences, the low Type 1 error rate simply means domization tests in phylogenetic inference. Syst. Biol.
that a greater degree of protection than the
nominal 5% is being provided, although this First submitted 19 July 2001; revision submitted
might also imply that the test has lower 17 December 2001; nal acceptance 17 December 2002
power. If the conservativeness of the test is Associate Editor: Mike Sanderson