THE DEVELOPMENT OF A DARWINIAN SCHEME FOR
ESTIMATING RANKINGS UNDER A STOCHASTIC ORDERING
Rafe M. J. Donahue, GlaxoWellcome, Inc.
P.O. Box 13398, Research Triangle Park, NC 27709
Keywords: Rankings, Darwinian, Bayesian, Table 1. Possible permutations of four teams and
Football, Sports whether or not they can be supported by the data
Introduction and the original question supported?
The basis of this Darwinian ranking scheme comes ABCD Yes
essentially from a statistics question arising from the ABDC No
NCAA Division I basketball tournament. In 1984 ACBD Yes
and 1985, while I was still in college at the Uni- ACDB Yes
versity of Dayton (UD), the school basketball team ADBC No
(the Flyers) lost both years in the NCAA tourna- ADCB No
ment to the eventual champions — to Georgetown BACD No
in 1984 and Villanova in 1985. Since the tourna- BADC No
ment is single-elimination, how is one to know that BCAD No
UD was not better than the ﬁve other teams that BCDA No
had also lost to the eventual champion each year? BDAC No
Eventhough the tournament is seeded based upon BDCA No
records heading into the tournament, is it possible CABD No
for the Flyers to make a claim to being essentially CADB No
tied for second place, seeing as their only loss in the CBAD No
tournament came to the champion? Would a diﬀer- CBDA No
ent layout of seedings have allowed UD to advance CDAB No
all the way to the title game before falling? CDBA No
The original solution DABC No
While these are obviously the musings of a wishful DACB No
alum, the question, at least to me, remained not for- DBAC No
mally answered. To answer the question, I reduced DBCA No
the question to that of ranking teams following a DCAB No
four-team single-elimination tournament since a 64- DCBA No
team tournament is just a composition of four-team
tournaments. other 21 permutations don’t hold up to the data that
Suppose that the teams in this tournament are A, we see. That is, only the three bolded permutations
B, C, and D and that A beats B and C beats D in the (ABCD, ACBD, and ACDB) in Table 1 have A bet-
ﬁrst round and that A beats C in the championship ter than B, C better than D, and A better than C.
game. Since B and C both lose to A on A’s run to How, then, do we estimate the true ordering?
the title, can they both lay claim to a share of second Consider each of the four teams and, under the
place? three tenable orderings, what place each team holds.
Consider the 4! = 24 possible permutations that Team A is ranked ﬁrst in all three of the order-
exist to rank the four teams and, for the time being, ings (and they win the tournament), so A ought
assume that being team A being ranked “better” be ranked ﬁrst. But what of the others? Team C
than team B implies that team A will beat team B. is ranked second twice and third once; team B is
Of the 24 possible orderings of the teams, which ones ranked second, third, and fourth; team D is ranked
are permissible given the game outcomes that are third once and fourth twice. We can then average
witnessed? the ranks across the three potential orderings and
Table 1 shows the 24 permutations and their per- compute average ranks of 1, 2 1/3, 3, and 3 2/3 for
missibility given the data. (In the permutations, teams A, C, B, and D, respectively.
“best” to “worst” ﬂows left to right.) Note that only This, of course, looks just like a posterior Bayes
three permutations are consistent with the data; the estimate (PBE) of the true order. To construct the
formal statistical framework is rather simple. Table 2. Construction of the posterior distribution
Let O be the set of all possible orderings of the Permu- Prior Posterior Posterior Posterior
four teams. Thus, there are 24 elements of O. These tation g(o) after A after C after A
are the 24 four-tuples listed in Table 1. Deﬁne a uni- beats B beats D beats C
form prior g(·) on O such that g(o) = 1/24 for each ABCD 1/24 1/12 1/6 1/3
of the 24 o’s in O. Then use Bayes’ Theorem to up- ABDC 1/24 1/12 0 0
date the prior after each game’s datum is available. ACBD 1/24 1/12 1/6 1/3
In this case Bayes’ Theorem takes the form ACDB 1/24 1/12 1/6 1/3
ADBC 1/24 1/12 0 0
Pr(datum | o) g(o)
Pr(o | datum) = . ADCB 1/24 1/12 0 0
Pr(datum | o) g(o) BACD 1/24 0 0 0
BADC 1/24 0 0 0
The expression Pr(datum | o) — “the probability of BCAD 1/24 0 0 0
the datum given the ranking” — is just the proba- BCDA 1/24 0 0 0
bility of observing the game datum that we actually BDAC 1/24 0 0 0
witness conditional on assuming the ordering o is BDCA 1/24 0 0 0
true. Thus, we march across all 24 permutations CABD 1/24 1/12 1/6 0
and compute the value of Pr(o | datum) for each of CADB 1/24 1/12 1/6 0
the three games we witness in the tournament. That CBAD 1/24 0 0 0
is, for o = ABCD, Pr(A beats B | o) = 1 since if the CBDA 1/24 0 0 0
ordering o = ABCD is assumed to be true, then CDAB 1/24 1/12 1/6 0
team A should beat team B. On the other hand, for CDBA 1/24 0 0 0
o = CBDA, Pr(A beats B | o) = 0 since o = CBDA DABC 1/24 1/12 0 0
implies that team B is better than team A. (The DACB 1/24 1/12 0 0
obvious hitch here is that this assumption (A being DBAC 1/24 0 0 0
better than B implying that A will beat B when the DBCA 1/24 0 0 0
two teams play) is suspect. The next section will DCAB 1/24 1/12 0 0
deal with this issue.) DCBA 1/24 0 0 0
The construction of the posterior distribution is
carried out in Table 2. We see that the three ele- This problem involves what might be viewed as non-
ments of O that are non-zero in the ﬁnal posterior transitivity. Earlier, it was stated that team A being
distribution are the three elements that were found better than team B implied that team A will beat
to support the collective game data in Table 1. Tak- team B. If this is the case, then how do we rank
ing expections of the individual rankings relative to teams if, over the course of the season, A beats B,
the posterior distribution yields the estimates com- B beats C, but C beats A? This possibility of non-
puted above. transitvity of game results requires an adjustment in
So, extending the results to a 64-team tourna- how we view the concept of “better”. We need to
ment, the answer to the original question for faithful view the consequence of team A being better than
fans of the Flyers is “No, UD cannot lay a claim to team B to be that A is more likely (but not certain)
a tie for second place.” (Rats.) to beat team B than team B is to beat team A.
Learnings from the original solution Thus, our values of Pr(datum | o) need to take on
A few key points can be garnered from the devel- values other than zero or one. These probabilities
opment of the solution to that original question. will need to reﬂect the comparative diﬀerences that
Given a collection of possible rankings and us- exist in teams ranked at diﬀerent levels. For exam-
ing Bayes’ Theorem and “running” a collection of ple, for the four-team tournament, a matrix of prob-
games through it produces a posterior distribution abilities might be used to deﬁne how likely teams
that can be used to estimate the true order. Thus, ranked in the four positions are to beat each other.
this methodology could be used, with the data from Such a matrix, call it P , might take the form
games played over the course of an entire season,
to estimate a true ordering of a large number of 1 2 3 4
teams, say, an entire collection of college or profes- 1 − .60 .75 .90
sional sports teams. 2 .40 − .60 .75
P = ,
A problem exists, however, with the implemen- 3 .25 .40 − .60
tation as it was done for the four-team tournament. 4 .10 .25 .40 −
Table 3. Construction of the posterior distribu- probability of victory function as
tion for the example nondegenerate matrix P (MLE
bolded) .60, if i = 1,
Permu- Prior Posterior Posterior Posterior p(i) = .75, if i = 2,
tation g(o) after A after C after A .90, if i = 3,
beats B beats D beats C where i is the diﬀerence between the ranks.
ABCD .0417 .0500 .0600 .0900 Of course, this deﬁnition of the probability of vic-
ABDC .0417 .0500 .0400 .0720 tory function is open to debate and is dependent on
ACBD .0417 .0625 .0938 .1125 the context being used. It should be noted, how-
ACDB .0417 .0750 .0900 .1080 ever, that any constructed p(·) ought to possess some
ADBC .0417 .0625 .0313 .0563 essential properties. First, p(0) = 0.50. Thus, if
ADCB .0417 .0750 .0600 .0900 two teams are evenly matched, the likelihood of vic-
BACD .0417 .0333 .0400 .0480 tory should be the same for each team. Secondly,
BADC .0417 .0333 .0267 .0400 p(i) ≥ p(j) for i ≥ j and p(i) → 1 as i → ∞, imply-
BCAD .0417 .0208 .0313 .0250 ing that the probability of a victory gets larger when
BCDA .0417 .0083 .0100 .0050 teams are ranked further apart and that if teams are
BDAC .0417 .0208 .0104 .0125 ranked very far apart, the chance of victory for the
BDCA .0417 .0083 .0067 .0053 better-ranked team should go to 1.
CABD .0417 .0500 .0900 .0720 Furthermore, the function p(·) can be deﬁned to
CADB .0417 .0625 .0938 .0750 be measured relative to point-spreads. That is, in-
CBAD .0417 .0333 .0600 .0300 stead of computing the probability that team A sim-
CBDA .0417 .0208 .0313 .0063 ply beats team B given a particular element of O,
CDAB .0417 .0500 .0600 .0300 we can compute the probability that the diﬀerence in
CDBA .0417 .0333 .0400 .0080 score between team A and team B is d points given a
DABC .0417 .0500 .0100 .0150 particular element of O. The probability of victory
DACB .0417 .0625 .0313 .0375 function then becomes a probability of score diﬀer-
DBAC .0417 .0333 .0067 .0080 ence function but the general mathematics stays the
DBCA .0417 .0208 .0104 .0083 same.
DCAB .0417 .0500 .0400 .0320 The new goal
DCBA .0417 .0333 .0267 .0133 Based on the above developments, I sought to de-
velop procedures for ranking NCAA Division I bas-
ketball teams. This proved impossible. There are
implying that the probability that, for a given ele- currently over 300 teams competing in Division I.
ment o, the team ranked ﬁrst beats the team ranked Thus, to carry out the computations would require
second is p12 = 0.60. Note that pij = 1 − pji . Using updating the posterior distribution for each of over
the examples examined previously, for o = ABCD, 300! elements of the set O. Computationally, this
Pr(A beats B | o) = 0.60 since A is ranked ﬁrst and was too intense. Even a small number of teams is
B is ranked second; for o = CBDA, we have that diﬃcult. Nine teams in the Atlantic Coast Confer-
Pr(A beats B | o) = 0.25 since B is ranked second ence (ACC) requires 9! = 362880 elements of O, up-
but A is ranked fourth. The four-team tournament dated for each of 72 conference games. A league
data that was presented earlier, but now evaluated with twelve teams would require over 479,000,000
against this matrix, is shown in Table 3. Using this elements in O.
matrix P , the PBE of the ranks for teams A, C, B, The idea of a random sample of elements of O be-
and D are 1.70, 2.47, 2.89, and 2.95, respectively. ing used to represent the whole set O hit me one day
One may also note that, for this matrix P , the max- while mowing the lawn. Essentially, I ﬁgured I could
imum likelihood estimate (the bolded one in the ta- take a workably-sized subset of the elements of O to
ble, with probability of 0.1125) also ranks the teams which to assign non-zero prior probabilities and force
in the same order as the PBE. the complementary elements to have zero prior prob-
Note further that this particular matrix P pos- ability. Since these elements with zero prior proba-
sesses a certain degree of what we might call sta- bility would never move oﬀ of zero in the computa-
tionarity. That is, pij = pi+k,j+k for appropriate tions, they could be skipped in updating the prior
values of k. In this case, it is then possible to de- after each game. And since football season was then
ﬁne these probabilities of victory only in terms of upon me, I sought to implement this scheme for the
the diﬀerences between the ranks. Thus, deﬁne this 1998 college football season.
Since this prior was no longer non-informative, and well enough got to pass on their information to
I decided to be judicious in the random selection the next generation of elements.
of elements to include in the sample, that is, ele- This concept worked quite well and was used to
ments of O were not equally likely to be selected. It rank the Division I college football teams in 1998.
made little sense to me to select elements of O that Adjustments and reﬁnements
put, for example, Florida State towards the bottom A number of issues appeared during the running
and Fordham near the top. So, permutations were of the 1998 college football season.
selected in a fashion that was weighted relative to Most obvious was a problem with restricting the
some composite rankings available from major col- ranks to integers between 1 and n, where n is the
lege football publications. number of teams under consideration. The problem
Obstacles and the Darwinian solution showed up when teams near the top of the rankings
This brilliant idea failed miserably. played each other. For example, if the team ranked 2
In testing the methodology on the 1998 college played the team ranked 3 and team 2 won, then
football season, there were over 230 teams competing team 3 would take a big slide in the following week’s
in Division I. In running the game results through rankings since there was little room for team 2 to
the system with a sample of size 100,000 elements of move up.
O, essentially all of the individual elements were not In 1999, to solve this problem, I moved from an
good ﬁts to the data. Any particular element that absolute ranking system to one that uses ratings
happened to have perhaps at least a grain of truth based on ranks from 1 to n. At the start of the
absorbed very quickly nearly all of the probability season, teams were rated from 1 to n, relative to
mass. Thus, after running a couple of weeks of data the collective prior information that I had available.
through the system, one element carried probability In this new scheme, however, I removed the restric-
of 0.99999 while the other 99,999 collectively held tion of using only integers from 1 to n and allowed
0.00001. This made the PBE essentially equal to the the ratings to slide “up” past 1 and “down” past
ranking held in that one element that had absorbed n allowing really good teams to distance themselves
all the mass and this element was only a slightly from the rest of the ﬁeld and to allow highly ranked
better estimate than the others in the sample but teams to not pay too high a penalty for losing to an-
still far from the truth. other highly ranked team. I call this allowance for
An eﬀort at solving this problem, namely putting teams to move to negative ratings the ﬂoating origin
a lower bound, say, 1 × 10−10 or so, on the mass adjustment.
each element could carry, worked in keeping the mass In an attempt to minimize the eﬀect of the prior,
from collecting on one best element but the system I also shrunk the distance from the best to the worst
very rapidly lost information that was contained in teams for initial ratings of 1 to n to 1 to kn, where k
the game scores it had been fed earlier. That is, is some number between 0 and 1. If we set k to zero,
the system’s memory of who had beaten whom in then the prior is essentially non-informative and all
prior weeks was lost very quickly when the lower teams have the same rating at the beginning of the
bound was used. This, obviously, was not a suitable season. If k is set to unity, then we get them spaced
solution. from 1 to n. I call the factor k the non-unitarian
Instead of using a lower bound on the mass each prior stretch factor.
element could carry, I chose instead to set to zero Setting non-unitarian prior strecth factor to 70%
the mass for those elements whose mass dropped be- for the 1999 season, I generated weekly rankings
low some threshold. Therefore, elements that were (starting on October 3) for all 236 NCAA Division I
shown to be very poor ﬁts to the data were “killed football teams. The top ﬁve schools at the end of
oﬀ” and no longer confused the process by absorb- the season are presented in Table 4.
ing any mass. When enough game data were run Note that Nebraska and Florida State are nearly
through the system so that the population of the in a statistical dead heat as are Tennessee, Virginia
sample dropped below a threshold number of ele- Tech, and Wisconsin.
ments, the PBE based on the surviving elements was College football rankings using an implementa-
computed and then a new generation of elements tion of the Darwinian ranking scheme for NCAA Di-
similar to the current PBE based on the elements vision I schools 1998, 1999, and 2000 can be found
that were not killed oﬀ was created and added to on the web at
the population of non-zero-weighted elements — the home.earthlink.net/~rafedonahue/rankings.
concept of the Darwinian scheme. The elements that Work still needing to be done
survived the environment (game data) long enough One substantial problem still remains: blowouts.
Table 4. Top ﬁve teams for 1999 college football dealing with blowouts will be part of the changes for
Method Neb. FSU Tenn. Virginia Wisc. the 2000 scheme.
Tech Discussion and conclusions
Darwinian Attempts to rank sports team are problematic be-
rating −25 −24 −13 −13 −12
cause no clear deﬁnition seems to exist for what is
Darwinian meant by the “best” team. Is it the team you would
rank 1 2 3 4 5 least like to play next week? Is it the team that
is playing best right now? Is it the team that has
AP 3 1 9 2 4
played the best throughout the year? What do we
Coaches’ 2 1 9 3 4
do about losses (injury, suspension, arrest, etc.) of
Massey 2 1 8 3 6
key players? Should this be included?
Sagarin 2 1 9 3 8
I like the idea of the best team being the one that,
if you are a coach, you would least like to play next
The system I use is based on the diﬀerence in points Saturday and I think that one needs to view “better
scored by the two teams. Noting that winning 42– than” in terms of the probability of victory.
7 is probably just as convincing as winning 61–3, The Darwinian scheme carries with it several pos-
when computing the probability of the score diﬀer- itive features. It allows for inclusion of all the game
ence given the particular element of O under con- data in a season, yet it creates natural weightings
sideration, I put a lower bound on the height of the that give more recent games more weight. The ﬂoat-
density in the tails of the distribution. The distri- ing origin adjustment allows teams to separate them-
bution that I currently use makes the distribution of selves from the pack by playing opponents teams.
the diﬀerence in score Gaussian with mean and vari- Teams with weak schedules are exposed by not be-
ance that are dependent on the diﬀerence between ing able to distance themselves from the other teams
the two teams’ ratings. The mean increases as a in the league. The scheme also considers, in a round-
horizontal parabola from 0 to 84 points as the dis- about way, every possible ordering of the teams, so
tance between the ratings goes from 0 to 237 while if there is really a “true” ordering, this scheme ac-
the standard deviation increases linearly from 7 at 0 tually has a chance to ﬁnd it.
to 28 at 237. This implies that 95% of games be- Of course, the downside of the system, as with
tween evenly-matched teams should have score dif- any Bayesian statistics, is the choice of the prior
ferences between ±14 points and 95% of games be- and the model for computing the probability of the
tween the best and worst teams in college football data given the element of O and the valid complaints
should yield score diﬀerences with a mean of be- that some may have concerning that selection. Fur-
tween 28 and 140 points. Any value of the density thermore, some might argue that the only statistic
smaller than 0.0004 is given a value of 0.0004; that of interest is which team won and it doesn’t matter
translates to eﬀectively truncating the distribution if a team wins 31–30 or 63-3, it is still a victory.
at approximately 4.3 standard deviations from the These issues might be best settled on the call-in
mean. This works well to keep the good teams from radio shows.
completely obliterating the bad teams; however, it There is also another possible use for this method-
doesn’t work well when to similarly ranked teams ology in the ranking context, however. A bevy of
play each other and one gets blown out. The prob- rating and ranking systems are used every year, each
lem with this situation has to do with the sample el- with its own strengths and weaknesses. It is possbile,
ements that are alive at the time of the blowout. El- certainly in theory and most likely in practice, to
ements that are well-suited to the data survive. But use this Bayesian or Darwinian methodology to com-
problems arise when there are no elements available pare a number of systems under discussion. That is,
that are well-suited to the data, as is the case when which of the 50 to 100 systems used to rank college
closely rated teams play in a blowout. In this case, football best match the data? One could assign a
essentially all the elements in the population receive uniform prior across the systems and then run the
approximately the same adjustment in probability game data over them. Selecting the systems with the
(the 0.0004 lower bound threshold discussed above) highest posterior probability would select a winner.
and there is no net eﬀect on the overall ratings. In- On the other hand, the PBE of the ranking would be
creasing the diversity of the population alive at any a weighted average of all these systems and certainly
given time would allow blowouts to be more prop- could be argued to be more accurate than the arbi-
erly treated but also increases the variability of the trary system the Bowl Championship Series (BCS)
estimates. Attempts at ﬁnding a compromise when is using today.