Powerpoint template for scientific posters _Swarthmore College_.ppt

Document Sample
Powerpoint template for scientific posters _Swarthmore College_.ppt Powered By Docstoc
					                                                                              Modeling Soccer: A Hierarchical Model
                                                                                                              Paul Goldsmith-Pinkham
                                                                                      Swarthmore College, Department of Mathematics & Statistics

My Assumptions about Soccer:                                             The Model                                                 Generating the Parameters Using MCMC                                      Conclusions
Assumption 1: Offense and Defense are                                       Following these assumptions, I propose a                  Due to the structure of the model, it is not                              Using a hierarchical model, I have been able to
independent from each other.                                             hierarchical model with these characteristics:            possible to directly calculate the full-posterior                         take a small data-set (World Cup pool play) and
                                                                                                                                   distributions of and . Instead, it is possible to                         estimate strength parameters for the teams.
In any particular game of soccer, two teams
                                                                                                                                                                                                             While the model does a good job of modeling the
attempt to score goals on the opposing team’s goal                          Yi | i   Poisson(nii ), i  1,..., k                 calculate the conditional distributions:
while defending their own goal from attack. Thus,                                                                                                          f ( | yi ,  )                                   training data to which it is exposed, it does less
                                                                                                                                                                                                             of a good job predicting the winners in the next
we can roughly break down a team into an                                    i |  ,  ~ Gamma( ,  ), i  1,..., k                                       f (i |  , yi )                                  four rounds of the World Cup. The model predicts
“offense” and “defense”, which act separately from
                                                                         where yi is the number of goals scored by a               Using these conditional distributions, it is                              only two out of eight of the winners from the
each other.
                                                                         team’s offense or allowed by a team’s defense, i         possible to use a version of the Markov Chain                             Round of Sixteen, and two out of four of the
Assumption 2: Individual goals are independent of
                                                                         represent the strength of offense and defense for         Monte Carlo called the Gibbs sampler. The Gibbs                           winners from the Quarterfinals. However, the
previous goals.
                                                                         each individual team, n is the total number of            sampler simply states that given a set of                                 model successfully predicts both of the winners
In addition, for simplicity’s sake, we assume that                                                                                                                                                           in the Semifinals, but fails miserably in the
                                                                         games played by each team, is the shape                  parameters that are related by conditional
any particular goal is independent from a previous                                                                                                                                                           Finals.
                                                                         parameter, and  is the scale parameter. Also, in         distributions, it is possible to use alternating
goal; that is, the chance of scoring a goal when the                                                                                                                                                            The reason behind the lack of complete
                                                                         this model it is assumed that any observed score          conditional sampling to approach the marginal
score is 0-0 is independent of the chance of scoring                                                                                                                                                         success can be linked to a small training set: due
                                                                         can be split into offensive ‘scores’, and                 distribution of the parameters, and obtain
a goal when the scores is 2-0.                                                                                                                                                                               to the small amount of data, outliers, such as
                                                                         defensive ‘allowed scores’. In other words,               estimates of the parameters.
Assumption 3: Goal scoring follows a Poisson                                                                                                                                                                 Germany’s 8-0 beatdown of Saudi Arabia, skew
process.                                                                         Yobserved  Yoffense  Ydefense                 ˆ ˆ
                                                                                                                                                             Draw ˆO andˆ using
                                                                                                                                                                                                             the parameters higher than they should be. With
This assumption follows from our previous                                   For simplicity’s sake, in this model I assume       O D                            f (i | yi ,  )                             more data, the estimated parameters will settle
assumption. We assume that due to soccer’s low                             1 and I use a data-set in which the number                                                                                      more. This is pursued further in my paper.
scores (i.e. low-probability of scoring) and due to                      of games for each team is equal (pool play). The
the independence of goals from previous goals, we
                                                                                                                                    Separate data into                             Draw O and D
                                                                         expanded forms of this model that relax these              offense yi and                                 using
can model a team’s scores using a Poisson                                                                                           defense yi using  ˆ                              f ( | yi )
                                                                         assumptions are explored further in my paper.
Assumption 4: Each team’s ability is not                                 What is a Hierarchical Model, and why use it?
determined in a uniform fashion; rather, the team                           A hierarchical model is a model that is useful
offense and defense capabilities are themselves                          when a distributions parameters are assumed to
drawn from an underlying distribution.                                                                                             Verification
                                                                         have been drawn from a common underlying
This assumption is fundamental for our use of a                                                                                       In order to test the validity of the model,
                                                                         distribution: in this case, we assume that
hierarchical model. We must assume that ability                                                                                    there are several measures that can be used. One
                                                                         different teams offense and defense are defined
throughout a particular soccer league is not                                                                                       way to verify if the cumulative distribution of
                                                                         by a common underlying Gamma distribution. The
uniformly distributed but is drawn instead from a                                                                                  scores generated using the estimated strength
                                                                         parameters that define this underlying
shaped distribution (in this model, a Gamma                                                                                        parameters matches up with the cumulative
                                                                         distribution are known as the hyperparameters
distribution).                                                                                                                     density (CDF) generated by the actual data. The
                                                                         for the model. In this model’s case, the
                                                                                                                                   results of this heuristic follow:
                                                                         hyperparameters are  and  .
                                                                            Having a team’s strength come from an                  CDF            Y= 0         Y <= 1              Y <= 2           Y <= 3
                                                                         underlying distribution also makes sense, since        Data             0.281          0.646              0.844            0.958
                                                                         all teams tend to draw from the same pool of
                                                                         players. While strength may vary between teams,        Predicted        0.305          0.635              0.841            0.938
About The Data Set:                                                      there should be an underlying average skill level
                                                                                                                                   Clearly, while the model is not perfect, it does a
This poster’s data was drawn from results of the                         on which most players and teams fall. A Gamma
                                                                                                                                   good job of matching up with the data. In fact,
2002 World Cup. The pool play data, in which each                        distribution is chosen for two reasons: first, it is
                                                                                                                                   this lack of a perfect match is preferred, since
team played three games against teams in their                           conjugate with the Poisson scoring process of
                                                                                                                                   the purpose of the hierarchical model is to
pool, was used to train the model. Quarterfinal,                         soccer, and so makes the math involved
                                                                                                                                   prevent over-fitting to our data.
semifinal and finals results were used in a                              dramatically easer, and second, the Gamma
                                                                                                                                      Another way to check the model’s validity is to
qualitative fashion to determine the validity of the                     distribution is easy to justify logically when
                                                                                                                                   utilize the strength parameters to generate                               Figure 1: The CDF of the data (black) and generated values (blue)
results.                                                                 compared to any alternative distribution, such as
                                                                                                                                   outcomes between teams playing in the Round of
                                                                         a normal curve.
                                                                                                                                   Sixteen, Quarterfinals, Semifinals, and Finals.                           Citations
                                                                            Additionally, the hierarchical model is useful
                                                                                                                                   Unfortunately, under this heuristic, the model did                        -Gelman et al, Bayesian Data Analysis, Chapman & Hall
                                                                         because it minimizes the number of parameters                                                                                       -Maher, M.J., ‘Modeling Association Football Scores’, Statistica
Acknowledgements                                                                                                                   not perform as well, due in large part to the
I would like to thank Phil Everson, for without him this project would   involved in any particular distribution, thereby                                                                                       Neerlandica, 36:109-118
                                                                                                                                   small amount of training data.
not have been possible.                                                  preventing over-fitting to the data.

Shared By:
liningnvp liningnvp http://