Predicting the FIFA womens’ world cup 2023
Using hybrid models to develop predictions for the whole 2023 women’s tournament.
Resources on the Project
| Resource | Date | Link |
|---|---|---|
| Shiny app with predictions for the FIFA women’s world cup 2023 | 2023-07-17 | Shiny App |
| Summary of the great talk from Andreas Groll at the Research Seminar in Vienna | 2023-07-24 | Summary |
Overview
In an inspiring and enlightening talk by Andreas Groll at our institute in June, I learned a lot about modeling match outcomes. While I had already touched on quite some literature in this regard, the talk provided a nice and detailed overview of the derivation and evolution of the hybrid random forest models by Groll, Ley, Schauberger, Eetvelde, and Zeileis (2019).
Since I had to summarize the talk for a PhD course of mine and I always wanted to do some soccer predictions but never quite found the time to do so, I finally grasped the opportunity to do my own predictions of the women’s World Cup in 2023 using the hybrid prediction ideas. A write-up on the evolution of the hybrid random forest model can be found here in the table above. Furthermore, a link to a shiny app with the predictions from the model described below can be found.
Methodology
The hybrid random forest
The general idea of the hybrid random forest model of Groll, Ley, Schauberger, and Eetvelde (2019) and Groll et al. (2021) is to predict the number of goals scored by each of the two teams playing against each other. Random forests are a very flexible class of machine learning models, that aggregates a large number of single decision trees in order to increase prediction accuracy of the outcome variable using a large set of predictors. In its “simple” form, Schauberger and Groll (2018) use a number of relevant covariates derived purely from available data such as bookmaker odds, FIFA rank, information on team structure, and economic factors of countries in order to predict scores. However, in its hybrid form, the covariates space is enlarged by so-called hybrid covariates, which are strength estimates of teams based on separate machine learning models. For their predictions of the UEFA Euro 2020, Groll et al. (2021) use three hybrid predictors, derived from an ability estimates model based on the bivariate Poisson model (see e.g. Karlis and Ntzoufras (2003)), a bookmaker consensus model (see e.g. Leitner, Zeileis, and Hornik (2010)), and plus-minus player ratings (see e.g. Hvattum and Gelade (2021)). Finally, in order to use the estimates of the number of goals resulting from the random forest for prediction of match results or simulation of tournaments, the predicted value from these hybrid models is used as an estimate for the event rate \(\lambda\) of a Poisson distribution. That is, the goals scored by the two teams playing each other are assumed to be drawn from two independent Poisson distributions (conditional on the covariates) to model match outcomes.
In this project, I tried to adapt their hybrid approach to obtain predictions for FIFA Women’s World Cup 2023. As time was limited for the project1 the data gathering step was reduced to the bare minimum. I collected historic match result data from Kaggle as well from fbref.com going back to 1969. Furthermore, I collected covariates such as the FIFA rank, information on host country and confederation as well as economic factors such as population and GDP growth factors of the countries participating in the World Cup. The main part of the work was however to derive some hybrid predictor variables.
From the collected data, the only reasonable approach for deriving hybrid team strength estimates was to use historic match data. This ruled out using a bookmaker consensus model2 and the plus-minus models. So first I took the natural approach and used a bivariate Poisson model, which is similar to the strength estimate from the Groll, Ley, Schauberger, and Eetvelde (2019) and was also used by Groll, Ley, Schauberger, Eetvelde, and Zeileis (2019) in the context of modeling women’s soccer. However, by that time I had already heard an interesting talk by Marius Ötting on the modelling of women’s scores. In their paper, Michels, Ötting, and Karlis (2023) provide a novel approach to modeling football games, which is based on extending the early work from Dixon and Coles (1997). In essence, they derive a bivariate model for the number of goals scored by the competing teams whose marginal distributions may be arbitrarily specified and the joint distribution is defined by the marginals and an adjustment term. This term allows for dependencies between football scores and it can be shown that the model from Dixon and Coles (1997), which shifts probability weights between the scores 0-0,0-1,1-0 and 1-1, is a special case of the general framework. The rationale for adjusting the model from Dixon and Coles (1997) is the observation, that there are fundamental differences in the behavior of women’s scores as opposed to men’s football scores. Michels, Ötting, and Karlis (2023) derive a version of the general framework which they term alternative negative binomial Sarmanov (ANS) model and show that this model is able to capture idiosyncrasies of women football, such as negative correlation and overdispersion. Thus naturally it seemed interesting to apply this model to the data.
Estimation of team abilites with the ANS model
The idea is to derive strength parameters for each team competing in the World Cup3. Obtaining estimates from the bivariate Poisson model can be done following Groll, Ley, Schauberger, Eetvelde, and Zeileis (2019), which is easy in R, one only has to slightly adjust the functions from the bivpois-package (Karlis and Ntzoufras (2005)) to allow for weighted likelihood estimation. However, in this project, I followed a slightly different approach in order to obtain attacking strength estimates as well as defending estimates.
The principle for deriving team estimates is always the same: Using historic match outcome data of \(M\) matches, for each team \(i\) an attack parameter \(att_i\) and a defense parameter \(def_i\) are estimated via maximizing the weighted log-likelihood for match results \[\begin{align}
\label{eq-wll}
\ell = \sum_{m = 1}^M w_m log(P(Y_{im} = y_{im},Y_{jm} = y_{jm}|\Theta)).
\end{align}\] The estimates \(att_i\) and \(def_i\) are usually incorporated into the mean \(\theta_i\) of the goals scored by the home team \(i\) and the mean \(\theta_j\) of the goals scored by the away team \(j\) such that4 \[\begin{align}
\label{eq-means}
\log(\theta_{im}) = att_i+def_j, \\
\log(\theta_{jm}) = att_j+def_i.
\end{align}\] In the bivariate Poisson model the \(\theta_{im}\) and \(\theta_{im}\) simply represent the expected number of goals \(\lambda_{im}\) and \(\lambda_{jm}\) for the respective teams. In the ANS model, this is however a bit more tricky. We basically follow Michels, Ötting, and Karlis (2023) and write the probability mass function (pmf) of the negative binomial distribution in terms of their mean-shape-parameterization \[\begin{align}
\label{eq-nb_pmf}
P(X = x;\mu,\phi) = \frac{\Gamma(x+\phi)}{\Gamma(\phi)x!} \big(\frac{\phi}{\phi+\mu}\big)^{\phi}\big(\frac{\mu}{\phi+\mu}\big)^{x}.
\end{align}\] From this expression, the pmf for the bivariate ANS model can be derived, which is found in the paper of Michels, Ötting, and Karlis (2023) (formula on page 13). The strength parameters are then incorporated into the mean parameter \(\mu_i\) and \(\mu_j\) of the negative binomial distributions for home and away team (i.e. \(\theta_{im} = \mu_{im}\) in Equation \(\eqref{eq-means}\)). The estimation can be performed by maximizing the weighted log-likelihood from Equation \(\eqref{eq-wll}\) with weights as taken from Groll, Ley, Schauberger, Eetvelde, and Zeileis (2019). This maximization is done in R using classical non-linear optimization tools (function nlm in R). Note that as a byproduct, also the shape parameters \(\phi_i\) and \(\phi_j\) for each team are estimated. One could think of adjusting these for team strengths as well instead of leaving them constant for each team and match.
Shiny app and takeaways from the project
In the above resources table, a shiny app that I have written for the World Cup can be found. It basically contains predictions for all matches of the group stage as well as predictions for the first knock-out round (estimates were updated after the group stage). Furthermore, it provides probabilities for each team of reaching each knock-out phase in the tournament, once before the tournament and once after the group stage. The clear favorite before the World Cup was the US national team, which has been the most dominant force in women´s soccer recently. The winning probability was probably overestimated a bit by the model, but the US Ladies led the FIFA ranking and outperformed the other teams in ability estimates as well. Interestingly enough, the attacking and defensive strength estimates from the ANS as well as the bivariate Poisson model (which can also be found in the app) showed that Spain (the team that later went on to win the WC) and England (the other finalist) were the two teams closest to the US in terms of their ability parameters. This is for example not reflected in the FIFA rank, where Spain is only 6th and England only 4th. This suggests that the ANS model (as well as the bivariate Poisson model) may be more accurate in indicating team strength than the FIFA rank. There are further interesting patterns to be observed from the prediction of the hybrid random forest, for example, that Spain and England followed the US in terms of winning probability before the tournament began. Interestingly, after the group stage, the probability of Spain winning drops substantially. However, this is explainable. For the model, the most probable path to the final for Spain would have included a match-up against the US in the semifinals - the major favorite of winning the title before the tournament. Thus, the probability for Spain of reaching the semifinals was quite high after the group stage (it was even higher than the one of the US reaching the semifinals), the probability of reaching and winning the final nevertheless decreased due to the prospect of having to play against the US team. However, as we all know, football has its own rules and so the US team fell short of expectations losing in an entertaining round of 16 match against Sweden, when top star Megan Rapinoe missed her penalty and probably her last chance of winning another big title. Overall, the prediction model provided interesting insights into the tournament and at least subjectively performed quite well. A more objective evaluation may be performed by analyzing the individual match predictions compared to the actual outcomes in order to obtain a statistical measure of performance, but this is left open for the future (if I ever find time to do so ).
In conclusion, embarking on the project was quite fun and it seems that match prediction and tournament simulation using hybrid random forest models is a promising approach. Of course, the model presented here may be enhanced by using more hybrid predictors, such as the ones used by Andy Groll and colleagues. A future interesting adventure that I plan to look into is to derive a hybrid variable that extracts match information for media text published before each match. Using NLP techniques one may extract sentiment on the favorite in each match-up, which could provide valuable insights for predicting matches, especially in tournaments. There are however many great approaches to modelling football games and many further directions to work on and I hope to be able to continue predicting soccer matches.
References
Footnotes
The World Cup started on the 17th of July and I started working on the project at the beginning of July.↩︎
It was surprisingly difficult to find betting odds for past women’s world cups online.↩︎
Actually, in order to train the random forest model, we need ability estimates for all participants from the last 3 World Cups as well, but this is beyond the point.↩︎
Note that usually a home effect intercept is incorporated into the model, however for women´s football the home effect is not that pronounced and possibly only relevant more recently were there has been more media and fan attention to women´s football. Thus I omitted it here.↩︎