football by suchufp

VIEWS: 4 PAGES: 47

									   An Intelligence
Approach to Evaluation
  of Sports Teams
              by
     Edward Kambour, Ph.D.




                             1
                     Agenda
I.     College Football
II.    Linear Model
III.   Generalized Linear Model
IV.    Intelligence (Bayesian) Approach
V.     Results
VI.    Other Sports
VII.   Future Work
          General Background
Goals
  Forecast winners of future games
    Beat the Bookie!
  Estimate the outcome of unscheduled games
    What’s the probability that Iowa would have beaten
     Ohio St?
  Generate reasonable rankings
        Major College Football
No playoff system
“Computer rankings” are an element of the
 BCS
114 teams
12 games for each in a season
                  Linear Model
Rothman (1970’s), Harville (1977), Stefani
 (1977), …, Kambour (1991), …, Sagarin???
  Response, Y, is the net result (point-spread)
  Parameter, , is the vector of ratings
  For a game involving teams i and j,
    E[Y] = i - j
        Linear Model (cont.)
Let X be a row vector with
        1    if k  i
        
  X k  1 if k  j
         0 otherwise
        
E[Y]=X
       Regression Model Notes
Least Squares  Normality, Homogeneity
College Football
  Estimate 100 parameters
  Sample size for a full season is about 600
  Design Matrix is sparse and not full rank
         Home-field Advantage
Generic Advantage (Stefani, 1980)
  Force i to be home team and j the visiting team
  Add an intercept term to X
  Adds one more parameter to estimate
  UAB = Alabama
  Rice = Texas A&M
Team Specific Advantage
  Doubles the number of parameters to estimate
          Linear Model Issues
Normality
Homogeneity
Lots of parameters, with relatively small
 sample size
  Overfitting
  The bookie takes you to the cleaners!
     Linear Model Issues (cont.)
Should we model point differential
  A and B play twice
    A by 34 in first, B by 14 in the second
    A by 10 each time
  Running up the score (or lack thereof)
  BCS: Thou shalt not use margin of victory in thy
   ratings!
           Logistic Regression
Rothman (1970s)
Linear Model
Use binary variable
  Winning is all that matters
  Avoid margin of victory
  Coin Flips
     Logistic Regression Issues
Still have sample size issues
Throw away a lot of information
Undefeated teams
                Transformations
Transform the differentials to normality
  Power transformations
  Rothman logistic transform
     Transforms points to probabilities for logistic
      regression
  “Diminishing returns” transforms
     Downweights runaway scores
            Power Transforms
Transform the point-spread
   Y = sign(Z)|Z|a
    a = 1  straight margin of victory
    a = 0  just win baby
    a = 0  Poisson or Gamma “ish”
  Maximum Likelihood Transform
1995-2002 seasons
    Power       -2ln(likelihood)
    0.1         52487
    0.3         41213
    0.5         35128
    0.67        32597
    0.8         31418
    1           31193

MLE = 0.98
           Predicting the Score
Model point differential
   Y1 = Si – Sj
Additionally model the sum of the points
 scored
   Y2 = Si + Sj
  Fit a similar linear model (different parameter
    estimates)
Forecast home and visitors score
  H = (Y1 + Y2 )/2, V = (Y2 - Y1)/2
    Another Transformation Idea
Scores (touchdowns or field goals) are
 arrivals, maybe Poisson
  Final score = 7 times a Poisson + 3 times a
   Poisson + …
Transform the scores to homogeneity and
 normality first
  The differences (and sums) should follow suit
        Square Root Transform
Since the score is “similar” to a linear
 combination of Poissons, square root should
 work
Transformation
  T  S k
 Why k?
  For small Poisson arrival rates, get better
   performance (Anscombe, 1948)
               Likelihood Test
LRT: No transformation vs. square root with
 fitted k
  Used College Football results from 1995-2002
  k = 21
  Transformation was significantly better
    p-value = 0.0023, chi-square = 9.26
       Predicting the Score with
               Transform
Model point differential
   Y1  Si  21  S j  21
Additionally model the sum of the points
 scored
   Y2  Si  21  S j  21
Forecast home and visitors score
  H = ((Y1 + Y2 )/2)2 , V = ((Y2 - Y1)/2)2
Note the point differential is the product
  Unresolved Linear Model Issues
Overfitting
History
  Going into the season, we have a good idea as
   to how teams will do
    The best teams tend to stay the best
    The worst teams tend to stay the worst
  Changes happen
    Kansas State
                  Intelligence Model
Concept
  The ratings and home-ads for year t are similar
   to those of year t-1. There is some drift from
   one year to the next.
Model
    t   t 1   t
    where
     t ~ N(0,  2 )
     Intelligence Model (Details)
Notation
   L teams
   M seasons of data
   Ni games in the ith season
  Xi : the Ni by 2L “X” matrix for season i
  Yi : the Ni vector of results for season i
   i : the Ni vector of results for season I
                Details (cont.)
Data Distribution:
  For all i = 1, 2, …, M
     Yi N  Xi  i ,  2  (independent)
                    Details (cont.)
Prior Distribution
                  I   0  2
   1                      
           2
               N  0, 
                  0 0.05I  
                        0.25I   0  2
    i    2
            N   i 1 ,                for i  2,..., M
                         0    0.01I  
     2   2,0.5 
       Details (finally, the end)
The Posterior Distribution of M and -2 is
 closed form and can be calculated by an
 iterative method
The Predictive Distribution for future results
 (transformed sum or difference) is straight-
 forward correlated normal (given the
 variance)
                   Forecasts
For Scores
  Simply untransform
    E[Z2] = Var[Z] + E[Z]2
For the point-spread
  Product of two normals
    Simulate 10000 results
              Enhanced Model
Fit the prior parameters
  Hierarchical models
  Drifts and initial variances
  No closed form for posterior and predictive
   distributions (at least as far as I know)
    The complete conditionals are straight-forward, so
     Gibbs sampling will work (eventually)
                                 Results
                 (www.geocities.com/kambour/football.html)


2002 Final Rankings
    Team              Rating         Home
    Miami             72.23 (1.03)   0.21 (0.04)
    Kansas St         72.04 (1.04)   0.44 (0.03)
    USC               71.95 (1.03)   0.04 (0.03)
    Oklahoma          71.85 (1.02)   0.18 (0.03)
    Texas             71.57 (1.03)   0.36 (0.03)
    Georgia           71.49 (1.03)   0.02 (0.03)
    Alabama           71.45 (1.03)   -0.09 (0.03)
    Iowa              71.30 (1.03)   0.21 (0.04)
    Florida St        71.29 (1.02)   0.43 (0.03)
    Virginia Tech     71.25 (1.03)   0.12 (0.03)
    Ohio St           71.18 (1.03)   0.27 (0.03)
                             Results
2002 Final Rankings
    Team            Rating     Home
    Miami           72.23      0.21
    Kansas St       72.04      0.44
    USC             71.95      0.04
    Oklahoma        71.85      0.18
    Texas           71.57      0.36
    Georgia         71.49      0.02
    Alabama         71.45      -0.09
    Iowa            71.30      0.21
    Florida St      71.29      0.43
    Virginia Tech   71.25      0.12
    Ohio St         71.18      0.27
                             Results
2002 Final Rankings
    Team            Rating     Home
    Miami           72.23      0.21
    Kansas St       72.04      0.44
    USC             71.95      0.04
    Oklahoma        71.85      0.18
    Texas           71.57      0.36
    Georgia         71.49      0.02
    Alabama         71.45      -0.09
    Iowa            71.30      0.21
    Florida St      71.29      0.43
    Virginia Tech   71.25      0.12
    Ohio St         71.18      0.27
                  Bowl Predictions
Ohio St               17
Miami Fl (-13)        31     0.8255   0.5228
Washington St         21
Oklahoma (-6.5)       31     0.7347   0.5797
Iowa                  21
USC (-6)              30     0.7174   0.5721
NC State (E)          20
Notre Dame            17     0.5639   0.5639
Florida St (+4)       24
Georgia               27     0.5719   0.5320
             2002 Final Record
Picking Winners
  522 – 157         0.769
Against the Vegas lines
  367 – 307 – 5     0.544
Best Bets
  9 – 7             0.563
  In 2001, 11 - 4
           ESPN College Pick’em
           (http://games.espn.go.com/cpickem/leader)


1.   Barry Schultz                     5830
2.   Jim Dobbs                         5687
3.   Michael Reeves                    5651
4.   Fup Biz                           5594
5.   Joe *                             5587
6.   Rising Cream                      5562
7.   Intelligence Ratings              5559
     Ratings System Comparison
         (http://tbeck.freeshell.org/fb/awards2002.html)


Todd Beck
  Ph.D. Statistician
  Rush Institute
 Intelligence Ratings – Best Predictors
    College Football Conclusions
Can forecast the outcome of games
  Capture the random nature
     High variability
     Sparse design
Scientists should avoid BCS
  Statistical significance is impossible
  Problem Complexity
  Other issues
                       NFL
Similar to College Football
Square root transform is applicable
Drift is a little higher than College Football
Better design matrix
  Small sample size
Playoff
                              NFL Results
                   (www.geocities.com/kambour/NFL.html)


2002 Final Rankings (after the Super Bowl)
    Team              Rating       Home
    Tampa Bay         70.72        0.29
    Oakland           70.57        0.28
    Philadelphia      70.55        0.10
    New England       70.16        0.12
    Atlanta           70.13        0.20
    NY Jets           70.10        -0.01
    Pittsburgh        69.95        0.28
    Green Bay         69.92        0.28
    Kansas City       69.90        0.51
    Denver            69.89        0.50
    Miami             69.89        0.49
           2002 Final NFL Record
Picking Winners
  162 – 104 – 1           0.609
Against the Vegas lines
  135 – 128 – 4           0.513
Best Bets
  9 – 8                   0.529
                NFL Europe
Similar to College and NFL
Square root transform
Dramatic drift
Teams change dramatically in mid-season
Few teams
  Better design matrix
            College Basketball
Transform?
  Much more normal (Central Limit Theorem)
A lot more games
  Intersectional games
Less emphasis on programs than in College
 Football
  More drift
NCAA tournament
     NCAA Basketball
          Pre-tournament Ratings
Team            Rating    Home
Arizona         100.06    3.97
Kentucky        99.33     4.32
Kansas          95.89     3.85
Texas           93.42     4.44
Duke            92.90     4.66
Oklahoma        90.19     4.31
Florida         90.65     3.99
Wake Forest     88.70     3.65
Syracuse        88.50     3.49
Xavier          87.89     3.37
Louisville      87.88     4.16
                     NBA
Similar to College Basketball
  Normal – No transformation
A lot more games – fewer teams
Playoffs are completely different from
 regular season
  Regular season – very balanced, strong home
   court
  Post season – less balanced, home court
   lessened
                     Hockey
Transform
  Rare events = “Poissonish”
    Square root with k around 1
A lot more games
History matters
Playoffs seem similar to regular season
Balance
                     Soccer
Similar to hockey
Transform
  Square root with low k
Not a lot of games
Friendlys versus cup play
Home pitch is pronounced
  Varies widely
              Soccer Results
Correctly forecasted 2002 World Cup final
  Brazil over Germany
Correctly forecasted US run to quarter-finals
Won the PROS World Cup Soccer Pool
         Future Enhancements
Hierarchical Approaches
  Conferences
More complicated drift models
  Correlations
  Individual drifts
  Drift during the season
  Mean correcting drift
  More informative priors

								
To top