THE DEVELOPMENT OF A DARWINIAN SCHEME FOR

                                  Rafe M. J. Donahue, GlaxoWellcome, Inc.
                              P.O. Box 13398, Research Triangle Park, NC 27709

Keywords: Rankings, Darwinian, Bayesian,                  Table 1. Possible permutations of four teams and
Football, Sports                                          whether or not they can be supported by the data
Introduction and the original question                                                   supported?
   The basis of this Darwinian ranking scheme comes                  ABCD                    Yes
essentially from a statistics question arising from the              ABDC                    No
NCAA Division I basketball tournament. In 1984                       ACBD                    Yes
and 1985, while I was still in college at the Uni-                   ACDB                    Yes
versity of Dayton (UD), the school basketball team                   ADBC                    No
(the Flyers) lost both years in the NCAA tourna-                     ADCB                    No
ment to the eventual champions — to Georgetown                       BACD                    No
in 1984 and Villanova in 1985. Since the tourna-                     BADC                    No
ment is single-elimination, how is one to know that                  BCAD                    No
UD was not better than the five other teams that                      BCDA                    No
had also lost to the eventual champion each year?                    BDAC                    No
Eventhough the tournament is seeded based upon                       BDCA                    No
records heading into the tournament, is it possible                  CABD                    No
for the Flyers to make a claim to being essentially                  CADB                    No
tied for second place, seeing as their only loss in the              CBAD                    No
tournament came to the champion? Would a differ-                      CBDA                    No
ent layout of seedings have allowed UD to advance                    CDAB                    No
all the way to the title game before falling?                        CDBA                    No
The original solution                                                DABC                    No
   While these are obviously the musings of a wishful                DACB                    No
alum, the question, at least to me, remained not for-                DBAC                    No
mally answered. To answer the question, I reduced                    DBCA                    No
the question to that of ranking teams following a                    DCAB                    No
four-team single-elimination tournament since a 64-                  DCBA                    No
team tournament is just a composition of four-team
tournaments.                                              other 21 permutations don’t hold up to the data that
   Suppose that the teams in this tournament are A,       we see. That is, only the three bolded permutations
B, C, and D and that A beats B and C beats D in the       (ABCD, ACBD, and ACDB) in Table 1 have A bet-
first round and that A beats C in the championship         ter than B, C better than D, and A better than C.
game. Since B and C both lose to A on A’s run to          How, then, do we estimate the true ordering?
the title, can they both lay claim to a share of second      Consider each of the four teams and, under the
place?                                                    three tenable orderings, what place each team holds.
   Consider the 4! = 24 possible permutations that        Team A is ranked first in all three of the order-
exist to rank the four teams and, for the time being,     ings (and they win the tournament), so A ought
assume that being team A being ranked “better”            be ranked first. But what of the others? Team C
than team B implies that team A will beat team B.         is ranked second twice and third once; team B is
Of the 24 possible orderings of the teams, which ones     ranked second, third, and fourth; team D is ranked
are permissible given the game outcomes that are          third once and fourth twice. We can then average
witnessed?                                                the ranks across the three potential orderings and
   Table 1 shows the 24 permutations and their per-       compute average ranks of 1, 2 1/3, 3, and 3 2/3 for
missibility given the data. (In the permutations,         teams A, C, B, and D, respectively.
“best” to “worst” flows left to right.) Note that only        This, of course, looks just like a posterior Bayes
three permutations are consistent with the data; the      estimate (PBE) of the true order. To construct the
formal statistical framework is rather simple.           Table 2. Construction of the posterior distribution
   Let O be the set of all possible orderings of the     Permu-    Prior Posterior          Posterior Posterior
four teams. Thus, there are 24 elements of O. These      tation    g(o)   after A            after C   after A
are the 24 four-tuples listed in Table 1. Define a uni-                   beats B            beats D   beats C
form prior g(·) on O such that g(o) = 1/24 for each       ABCD      1/24       1/12           1/6         1/3
of the 24 o’s in O. Then use Bayes’ Theorem to up-        ABDC      1/24       1/12            0           0
date the prior after each game’s datum is available.      ACBD      1/24       1/12           1/6         1/3
In this case Bayes’ Theorem takes the form                ACDB      1/24       1/12           1/6         1/3
                                                          ADBC      1/24       1/12            0           0
                         Pr(datum | o) g(o)
      Pr(o | datum) =                        .            ADCB      1/24       1/12            0           0
                          Pr(datum | o) g(o)              BACD      1/24        0              0           0
                                                          BADC      1/24        0              0           0
The expression Pr(datum | o) — “the probability of        BCAD      1/24        0              0           0
the datum given the ranking” — is just the proba-         BCDA      1/24        0              0           0
bility of observing the game datum that we actually       BDAC      1/24        0              0           0
witness conditional on assuming the ordering o is         BDCA      1/24        0              0           0
true. Thus, we march across all 24 permutations           CABD      1/24       1/12           1/6          0
and compute the value of Pr(o | datum) for each of        CADB      1/24       1/12           1/6          0
the three games we witness in the tournament. That        CBAD      1/24        0              0           0
is, for o = ABCD, Pr(A beats B | o) = 1 since if the      CBDA      1/24        0              0           0
ordering o = ABCD is assumed to be true, then             CDAB      1/24       1/12           1/6          0
team A should beat team B. On the other hand, for         CDBA      1/24        0              0           0
o = CBDA, Pr(A beats B | o) = 0 since o = CBDA            DABC      1/24       1/12            0           0
implies that team B is better than team A. (The           DACB      1/24       1/12            0           0
obvious hitch here is that this assumption (A being       DBAC      1/24        0              0           0
better than B implying that A will beat B when the        DBCA      1/24        0              0           0
two teams play) is suspect. The next section will         DCAB      1/24       1/12            0           0
deal with this issue.)                                    DCBA      1/24        0              0           0
    The construction of the posterior distribution is
carried out in Table 2. We see that the three ele-       This problem involves what might be viewed as non-
ments of O that are non-zero in the final posterior       transitivity. Earlier, it was stated that team A being
distribution are the three elements that were found      better than team B implied that team A will beat
to support the collective game data in Table 1. Tak-     team B. If this is the case, then how do we rank
ing expections of the individual rankings relative to    teams if, over the course of the season, A beats B,
the posterior distribution yields the estimates com-     B beats C, but C beats A? This possibility of non-
puted above.                                             transitvity of game results requires an adjustment in
    So, extending the results to a 64-team tourna-       how we view the concept of “better”. We need to
ment, the answer to the original question for faithful   view the consequence of team A being better than
fans of the Flyers is “No, UD cannot lay a claim to      team B to be that A is more likely (but not certain)
a tie for second place.” (Rats.)                         to beat team B than team B is to beat team A.
Learnings from the original solution                     Thus, our values of Pr(datum | o) need to take on
    A few key points can be garnered from the devel-     values other than zero or one. These probabilities
opment of the solution to that original question.        will need to reflect the comparative differences that
    Given a collection of possible rankings and us-      exist in teams ranked at different levels. For exam-
ing Bayes’ Theorem and “running” a collection of         ple, for the four-team tournament, a matrix of prob-
games through it produces a posterior distribution       abilities might be used to define how likely teams
that can be used to estimate the true order. Thus,       ranked in the four positions are to beat each other.
this methodology could be used, with the data from       Such a matrix, call it P , might take the form
games played over the course of an entire season,
to estimate a true ordering of a large number of                           1           2      3   4
                                                                                                    
teams, say, an entire collection of college or profes-                1 −             .60    .75 .90
sional sports teams.                                                  2  .40          −     .60 .75 
                                                                   P =                              ,
    A problem exists, however, with the implemen-                     3  .25         .40     − .60 
tation as it was done for the four-team tournament.                   4 .10           .25    .40 −
Table 3. Construction of the posterior distribu-        probability of victory function as
tion for the example nondegenerate matrix P (MLE
bolded)                                                                         .60,   if i = 1,
Permu-    Prior    Posterior Posterior Posterior                       p(i) =   .75,   if i = 2,
tation    g(o)      after A   after C   after A                                 .90,   if i = 3,
                   beats B   beats D   beats C          where i is the difference between the ranks.
ABCD       .0417      .0500     .0600     .0900            Of course, this definition of the probability of vic-
ABDC       .0417      .0500     .0400     .0720         tory function is open to debate and is dependent on
ACBD      .0417      .0625     .0938     .1125          the context being used. It should be noted, how-
ACDB       .0417      .0750     .0900     .1080         ever, that any constructed p(·) ought to possess some
ADBC       .0417      .0625     .0313     .0563         essential properties. First, p(0) = 0.50. Thus, if
ADCB       .0417      .0750     .0600     .0900         two teams are evenly matched, the likelihood of vic-
BACD       .0417      .0333     .0400     .0480         tory should be the same for each team. Secondly,
BADC       .0417      .0333     .0267     .0400         p(i) ≥ p(j) for i ≥ j and p(i) → 1 as i → ∞, imply-
BCAD       .0417      .0208     .0313     .0250         ing that the probability of a victory gets larger when
BCDA       .0417      .0083     .0100     .0050         teams are ranked further apart and that if teams are
BDAC       .0417      .0208     .0104     .0125         ranked very far apart, the chance of victory for the
BDCA       .0417      .0083     .0067     .0053         better-ranked team should go to 1.
CABD       .0417      .0500     .0900     .0720            Furthermore, the function p(·) can be defined to
CADB       .0417      .0625     .0938     .0750         be measured relative to point-spreads. That is, in-
CBAD       .0417      .0333     .0600     .0300         stead of computing the probability that team A sim-
CBDA       .0417      .0208     .0313     .0063         ply beats team B given a particular element of O,
CDAB       .0417      .0500     .0600     .0300         we can compute the probability that the difference in
CDBA       .0417      .0333     .0400     .0080         score between team A and team B is d points given a
DABC       .0417      .0500     .0100     .0150         particular element of O. The probability of victory
DACB       .0417      .0625     .0313     .0375         function then becomes a probability of score differ-
DBAC       .0417      .0333     .0067     .0080         ence function but the general mathematics stays the
DBCA       .0417      .0208     .0104     .0083         same.
DCAB       .0417      .0500     .0400     .0320         The new goal
DCBA       .0417      .0333     .0267     .0133            Based on the above developments, I sought to de-
                                                        velop procedures for ranking NCAA Division I bas-
                                                        ketball teams. This proved impossible. There are
implying that the probability that, for a given ele-    currently over 300 teams competing in Division I.
ment o, the team ranked first beats the team ranked      Thus, to carry out the computations would require
second is p12 = 0.60. Note that pij = 1 − pji . Using   updating the posterior distribution for each of over
the examples examined previously, for o = ABCD,         300! elements of the set O. Computationally, this
Pr(A beats B | o) = 0.60 since A is ranked first and     was too intense. Even a small number of teams is
B is ranked second; for o = CBDA, we have that          difficult. Nine teams in the Atlantic Coast Confer-
Pr(A beats B | o) = 0.25 since B is ranked second       ence (ACC) requires 9! = 362880 elements of O, up-
but A is ranked fourth. The four-team tournament        dated for each of 72 conference games. A league
data that was presented earlier, but now evaluated      with twelve teams would require over 479,000,000
against this matrix, is shown in Table 3. Using this    elements in O.
matrix P , the PBE of the ranks for teams A, C, B,         The idea of a random sample of elements of O be-
and D are 1.70, 2.47, 2.89, and 2.95, respectively.     ing used to represent the whole set O hit me one day
One may also note that, for this matrix P , the max-    while mowing the lawn. Essentially, I figured I could
imum likelihood estimate (the bolded one in the ta-     take a workably-sized subset of the elements of O to
ble, with probability of 0.1125) also ranks the teams   which to assign non-zero prior probabilities and force
in the same order as the PBE.                           the complementary elements to have zero prior prob-
   Note further that this particular matrix P pos-      ability. Since these elements with zero prior proba-
sesses a certain degree of what we might call sta-      bility would never move off of zero in the computa-
tionarity. That is, pij = pi+k,j+k for appropriate      tions, they could be skipped in updating the prior
values of k. In this case, it is then possible to de-   after each game. And since football season was then
fine these probabilities of victory only in terms of     upon me, I sought to implement this scheme for the
the differences between the ranks. Thus, define this      1998 college football season.
   Since this prior was no longer non-informative,       and well enough got to pass on their information to
I decided to be judicious in the random selection        the next generation of elements.
of elements to include in the sample, that is, ele-          This concept worked quite well and was used to
ments of O were not equally likely to be selected. It    rank the Division I college football teams in 1998.
made little sense to me to select elements of O that     Adjustments and refinements
put, for example, Florida State towards the bottom           A number of issues appeared during the running
and Fordham near the top. So, permutations were          of the 1998 college football season.
selected in a fashion that was weighted relative to          Most obvious was a problem with restricting the
some composite rankings available from major col-        ranks to integers between 1 and n, where n is the
lege football publications.                              number of teams under consideration. The problem
Obstacles and the Darwinian solution                     showed up when teams near the top of the rankings
   This brilliant idea failed miserably.                 played each other. For example, if the team ranked 2
   In testing the methodology on the 1998 college        played the team ranked 3 and team 2 won, then
football season, there were over 230 teams competing     team 3 would take a big slide in the following week’s
in Division I. In running the game results through       rankings since there was little room for team 2 to
the system with a sample of size 100,000 elements of     move up.
O, essentially all of the individual elements were not       In 1999, to solve this problem, I moved from an
good fits to the data. Any particular element that        absolute ranking system to one that uses ratings
happened to have perhaps at least a grain of truth       based on ranks from 1 to n. At the start of the
absorbed very quickly nearly all of the probability      season, teams were rated from 1 to n, relative to
mass. Thus, after running a couple of weeks of data      the collective prior information that I had available.
through the system, one element carried probability      In this new scheme, however, I removed the restric-
of 0.99999 while the other 99,999 collectively held      tion of using only integers from 1 to n and allowed
0.00001. This made the PBE essentially equal to the      the ratings to slide “up” past 1 and “down” past
ranking held in that one element that had absorbed       n allowing really good teams to distance themselves
all the mass and this element was only a slightly        from the rest of the field and to allow highly ranked
better estimate than the others in the sample but        teams to not pay too high a penalty for losing to an-
still far from the truth.                                other highly ranked team. I call this allowance for
   An effort at solving this problem, namely putting      teams to move to negative ratings the floating origin
a lower bound, say, 1 × 10−10 or so, on the mass         adjustment.
each element could carry, worked in keeping the mass         In an attempt to minimize the effect of the prior,
from collecting on one best element but the system       I also shrunk the distance from the best to the worst
very rapidly lost information that was contained in      teams for initial ratings of 1 to n to 1 to kn, where k
the game scores it had been fed earlier. That is,        is some number between 0 and 1. If we set k to zero,
the system’s memory of who had beaten whom in            then the prior is essentially non-informative and all
prior weeks was lost very quickly when the lower         teams have the same rating at the beginning of the
bound was used. This, obviously, was not a suitable      season. If k is set to unity, then we get them spaced
solution.                                                from 1 to n. I call the factor k the non-unitarian
   Instead of using a lower bound on the mass each       prior stretch factor.
element could carry, I chose instead to set to zero          Setting non-unitarian prior strecth factor to 70%
the mass for those elements whose mass dropped be-       for the 1999 season, I generated weekly rankings
low some threshold. Therefore, elements that were        (starting on October 3) for all 236 NCAA Division I
shown to be very poor fits to the data were “killed       football teams. The top five schools at the end of
off” and no longer confused the process by absorb-        the season are presented in Table 4.
ing any mass. When enough game data were run                 Note that Nebraska and Florida State are nearly
through the system so that the population of the         in a statistical dead heat as are Tennessee, Virginia
sample dropped below a threshold number of ele-          Tech, and Wisconsin.
ments, the PBE based on the surviving elements was           College football rankings using an implementa-
computed and then a new generation of elements           tion of the Darwinian ranking scheme for NCAA Di-
similar to the current PBE based on the elements         vision I schools 1998, 1999, and 2000 can be found
that were not killed off was created and added to         on the web at
the population of non-zero-weighted elements — the
concept of the Darwinian scheme. The elements that       Work still needing to be done
survived the environment (game data) long enough             One substantial problem still remains: blowouts.
Table 4. Top five teams for 1999 college football         dealing with blowouts will be part of the changes for
 Method     Neb.   FSU Tenn.       Virginia Wisc.        the 2000 scheme.
                                    Tech                 Discussion and conclusions
Darwinian                                                    Attempts to rank sports team are problematic be-
 rating   −25       −24     −13      −13       −12
                                                         cause no clear definition seems to exist for what is
Darwinian                                                meant by the “best” team. Is it the team you would
  rank         1      2      3         4        5        least like to play next week? Is it the team that
                                                         is playing best right now? Is it the team that has
   AP          3      1      9         2        4
                                                         played the best throughout the year? What do we
 Coaches’      2      1      9         3        4
                                                         do about losses (injury, suspension, arrest, etc.) of
 Massey        2      1      8         3        6
                                                         key players? Should this be included?
 Sagarin       2      1      9         3        8
                                                             I like the idea of the best team being the one that,
                                                         if you are a coach, you would least like to play next
The system I use is based on the difference in points     Saturday and I think that one needs to view “better
scored by the two teams. Noting that winning 42–         than” in terms of the probability of victory.
7 is probably just as convincing as winning 61–3,            The Darwinian scheme carries with it several pos-
when computing the probability of the score differ-       itive features. It allows for inclusion of all the game
ence given the particular element of O under con-        data in a season, yet it creates natural weightings
sideration, I put a lower bound on the height of the     that give more recent games more weight. The float-
density in the tails of the distribution. The distri-    ing origin adjustment allows teams to separate them-
bution that I currently use makes the distribution of    selves from the pack by playing opponents teams.
the difference in score Gaussian with mean and vari-      Teams with weak schedules are exposed by not be-
ance that are dependent on the difference between         ing able to distance themselves from the other teams
the two teams’ ratings. The mean increases as a          in the league. The scheme also considers, in a round-
horizontal parabola from 0 to 84 points as the dis-      about way, every possible ordering of the teams, so
tance between the ratings goes from 0 to 237 while       if there is really a “true” ordering, this scheme ac-
the standard deviation increases linearly from 7 at 0    tually has a chance to find it.
to 28 at 237. This implies that 95% of games be-             Of course, the downside of the system, as with
tween evenly-matched teams should have score dif-        any Bayesian statistics, is the choice of the prior
ferences between ±14 points and 95% of games be-         and the model for computing the probability of the
tween the best and worst teams in college football       data given the element of O and the valid complaints
should yield score differences with a mean of be-         that some may have concerning that selection. Fur-
tween 28 and 140 points. Any value of the density        thermore, some might argue that the only statistic
smaller than 0.0004 is given a value of 0.0004; that     of interest is which team won and it doesn’t matter
translates to effectively truncating the distribution     if a team wins 31–30 or 63-3, it is still a victory.
at approximately 4.3 standard deviations from the            These issues might be best settled on the call-in
mean. This works well to keep the good teams from        radio shows.
completely obliterating the bad teams; however, it           There is also another possible use for this method-
doesn’t work well when to similarly ranked teams         ology in the ranking context, however. A bevy of
play each other and one gets blown out. The prob-        rating and ranking systems are used every year, each
lem with this situation has to do with the sample el-    with its own strengths and weaknesses. It is possbile,
ements that are alive at the time of the blowout. El-    certainly in theory and most likely in practice, to
ements that are well-suited to the data survive. But     use this Bayesian or Darwinian methodology to com-
problems arise when there are no elements available      pare a number of systems under discussion. That is,
that are well-suited to the data, as is the case when    which of the 50 to 100 systems used to rank college
closely rated teams play in a blowout. In this case,     football best match the data? One could assign a
essentially all the elements in the population receive   uniform prior across the systems and then run the
approximately the same adjustment in probability         game data over them. Selecting the systems with the
(the 0.0004 lower bound threshold discussed above)       highest posterior probability would select a winner.
and there is no net effect on the overall ratings. In-    On the other hand, the PBE of the ranking would be
creasing the diversity of the population alive at any    a weighted average of all these systems and certainly
given time would allow blowouts to be more prop-         could be argued to be more accurate than the arbi-
erly treated but also increases the variability of the   trary system the Bowl Championship Series (BCS)
estimates. Attempts at finding a compromise when          is using today.

To top