Modeling Soccer: A Hierarchical Model
Swarthmore College, Department of Mathematics & Statistics
My Assumptions about Soccer: The Model Generating the Parameters Using MCMC Conclusions
Assumption 1: Offense and Defense are Following these assumptions, I propose a Due to the structure of the model, it is not Using a hierarchical model, I have been able to
independent from each other. hierarchical model with these characteristics: possible to directly calculate the full-posterior take a small data-set (World Cup pool play) and
distributions of and . Instead, it is possible to estimate strength parameters for the teams.
In any particular game of soccer, two teams
While the model does a good job of modeling the
attempt to score goals on the opposing team’s goal Yi | i Poisson(nii ), i 1,..., k calculate the conditional distributions:
while defending their own goal from attack. Thus, f ( | yi , ) training data to which it is exposed, it does less
of a good job predicting the winners in the next
we can roughly break down a team into an i | , ~ Gamma( , ), i 1,..., k f (i | , yi ) four rounds of the World Cup. The model predicts
“offense” and “defense”, which act separately from
where yi is the number of goals scored by a Using these conditional distributions, it is only two out of eight of the winners from the
team’s offense or allowed by a team’s defense, i possible to use a version of the Markov Chain Round of Sixteen, and two out of four of the
Assumption 2: Individual goals are independent of
represent the strength of offense and defense for Monte Carlo called the Gibbs sampler. The Gibbs winners from the Quarterfinals. However, the
each individual team, n is the total number of sampler simply states that given a set of model successfully predicts both of the winners
In addition, for simplicity’s sake, we assume that in the Semifinals, but fails miserably in the
games played by each team, is the shape parameters that are related by conditional
any particular goal is independent from a previous Finals.
parameter, and is the scale parameter. Also, in distributions, it is possible to use alternating
goal; that is, the chance of scoring a goal when the The reason behind the lack of complete
this model it is assumed that any observed score conditional sampling to approach the marginal
score is 0-0 is independent of the chance of scoring success can be linked to a small training set: due
can be split into offensive ‘scores’, and distribution of the parameters, and obtain
a goal when the scores is 2-0. to the small amount of data, outliers, such as
defensive ‘allowed scores’. In other words, estimates of the parameters.
Assumption 3: Goal scoring follows a Poisson Germany’s 8-0 beatdown of Saudi Arabia, skew
process. Yobserved Yoffense Ydefense ˆ ˆ
Draw ˆO andˆ using
the parameters higher than they should be. With
This assumption follows from our previous For simplicity’s sake, in this model I assume O D f (i | yi , ) more data, the estimated parameters will settle
assumption. We assume that due to soccer’s low 1 and I use a data-set in which the number more. This is pursued further in my paper.
scores (i.e. low-probability of scoring) and due to of games for each team is equal (pool play). The
the independence of goals from previous goals, we
Separate data into Draw O and D
expanded forms of this model that relax these offense yi and using
can model a team’s scores using a Poisson defense yi using ˆ f ( | yi )
assumptions are explored further in my paper.
Assumption 4: Each team’s ability is not What is a Hierarchical Model, and why use it?
determined in a uniform fashion; rather, the team A hierarchical model is a model that is useful
offense and defense capabilities are themselves when a distributions parameters are assumed to
drawn from an underlying distribution. Verification
have been drawn from a common underlying
This assumption is fundamental for our use of a In order to test the validity of the model,
distribution: in this case, we assume that
hierarchical model. We must assume that ability there are several measures that can be used. One
different teams offense and defense are defined
throughout a particular soccer league is not way to verify if the cumulative distribution of
by a common underlying Gamma distribution. The
uniformly distributed but is drawn instead from a scores generated using the estimated strength
parameters that define this underlying
shaped distribution (in this model, a Gamma parameters matches up with the cumulative
distribution are known as the hyperparameters
distribution). density (CDF) generated by the actual data. The
for the model. In this model’s case, the
results of this heuristic follow:
hyperparameters are and .
Having a team’s strength come from an CDF Y= 0 Y <= 1 Y <= 2 Y <= 3
underlying distribution also makes sense, since Data 0.281 0.646 0.844 0.958
all teams tend to draw from the same pool of
players. While strength may vary between teams, Predicted 0.305 0.635 0.841 0.938
About The Data Set: there should be an underlying average skill level
Clearly, while the model is not perfect, it does a
This poster’s data was drawn from results of the on which most players and teams fall. A Gamma
good job of matching up with the data. In fact,
2002 World Cup. The pool play data, in which each distribution is chosen for two reasons: first, it is
this lack of a perfect match is preferred, since
team played three games against teams in their conjugate with the Poisson scoring process of
the purpose of the hierarchical model is to
pool, was used to train the model. Quarterfinal, soccer, and so makes the math involved
prevent over-fitting to our data.
semifinal and finals results were used in a dramatically easer, and second, the Gamma
Another way to check the model’s validity is to
qualitative fashion to determine the validity of the distribution is easy to justify logically when
utilize the strength parameters to generate Figure 1: The CDF of the data (black) and generated values (blue)
results. compared to any alternative distribution, such as
outcomes between teams playing in the Round of
a normal curve.
Sixteen, Quarterfinals, Semifinals, and Finals. Citations
Additionally, the hierarchical model is useful
Unfortunately, under this heuristic, the model did -Gelman et al, Bayesian Data Analysis, Chapman & Hall
because it minimizes the number of parameters -Maher, M.J., ‘Modeling Association Football Scores’, Statistica
Acknowledgements not perform as well, due in large part to the
I would like to thank Phil Everson, for without him this project would involved in any particular distribution, thereby Neerlandica, 36:109-118
small amount of training data.
not have been possible. preventing over-fitting to the data.