Rethinking goals above expectation: Player evaluation in football using residualized scores

Robert Bajons | Joint work with Lucas Kook

Vienna University of Economics and Business

Nov 25, 2025

Motivation

Introduction

A common scheme for player evaluation in sports analytics:

Expected value metrics \(\rightarrow\) compare observed and expected outcome:
- Estimate expected outcome for game state (e.g. expected goals, expected pass completion)
- For all outcomes of a player (e.g. shots, passes): Compute differences between outcome and expected value for outcome.

Examples:
- Soccer and ice hockey: Goals (saved) above expectation (G(S)AX)
- Basketball: Quantified shooter impact (qSI)
- American football: Completion percentage above expectation (CPAE)

Problems:
- Lack of uncertainty quantification
- Lack of valid statistical inference
- Biases in the data (selection effects)

Example in soccer

Goals above expectation (GAX):
over a time frame (e.g. a season), compute differences between goals and xG for all shots of player \(i\)
\[ \operatorname{GAX}_i = \sum_{j=1}^{N_i} (Y_j - \hat h(Z_j))\]
- \(N_i \dots\) Number of shots of player \(i\)
- \(Y_j \dots\) actual outcome for shot \(j\)
- \(\hat h(Z_j) \dots\) estimator for \(h(Z) = \mathbb{E}[Y|Z]\) (xG for shot \(j\))
- \(Z \dots\) shot-specific features (distances, angles, …)

Recent criticism of GAX (Baron et al. 2024; Davis and Robberechts 2024):
- Instability and limited replicability (over seasons)
- Low (effective) sample size \(\rightarrow\) high uncertainty, lack of uncertainty quantification
- Biases arising from data:
  - Top teams and top players taking more shots
  - Including all shots instead of fraction (headers vs footers, long-distance vs short-distance)

Methodology

A parametric model for evaluating shooting skills

How can we identify outstanding shooters?

Setup:
- \(Y\) … binary outcome of a shot (goal/no goal)
- \(Z\) … shot-specific variables
- \(X\) … binary indicator of a player’s involvement (shooter/not shooter)

Logistic regression model (Pollard and Reep 1997):

\(Y \mid X,Z \sim \operatorname{Ber}(\pi(X,Z)), \quad \pi(X,Z) = P(Y=1 \mid X,Z)\) and

\[ \begin{aligned} \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + Z^{\top}\gamma. \end{aligned} \]

Goal: Inference on \(\beta\)
- Wald test, LR test, score test

Score test and GAX

Given i.i.d data \((Y_i,X_i,Z_i)_{i = 1}^N\) from the logistic regression model

Score of \(\beta\) (target of score tests): …………………………………………..

\[ \sum_{i = 1}^N\frac{\partial\log L(\beta,\gamma \mid Y_i,X_i,Z_i)}{\partial \beta}\]

Score test on \(\beta\) uses score under \(H_0: \beta = 0\):
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- \(\hat h(Z_j) = \text{expit} (Z_j^{\top}\hat \gamma)\) … \(\hat \gamma\) is the MLE of \(\gamma\) under \(H_0\)

Recall:

\[ \operatorname{GAX}_i = \sum_{j=1}^{N_i} (Y_j - \hat h(Z_j))\]

Since \(X_j\) binary \(\rightarrow\) score is exactly GAX for a player
- High GAX (in absolute terms) correctly standardized indicates significant players

GAX for player evaluation

Conclusion: GAX relates to a classical score test in logistic regression model
- Uncertainty quantification (and significance testing) via score test
- Interpretation of player quality as effect on log-odds (probability) of scoring

Problems:
- Linear model assumptions unrealistic
- Traditional GAX via machine learning model:
  - \(\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j, \quad \hat h\) estimated via arbitrary ML algorithm
  - No (valid) uncertainty quantification

A semiparametric reformulation

How can we identify outstanding shooters?

Problem reformulation (\(Y\), \(X\), and \(Z\) as before): partially linear logistic regression model (PLLM), where

\[ \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + g(Z) \]

Under PLLM: Test for \(Y\) conditionally independent of \(X\) given \(Z\) (\(Y \perp\!\!\!\perp X \mid Z\)) \(\Leftrightarrow\) Test for \(H_0 : \beta = 0\)
- Under \(H_0\): No modelling assumptions on relationship between \(Y\) and \(Z\), and \(X\) and \(Z\)

Use Generalised Covariance Measure (GCM) test (Shah and Peters 2020) for inference

GCM test

Generalised Covariance Measure:

\[ \begin{align} \operatorname{GCM} &= \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] \\ &=\mathbb{E}[(Y - \mathbb{E}[Y | Z])(X - \mathbb{E}[X | Z])] \end{align} \]

Basis for GCM test:

\[Y \perp\!\!\!\perp X \mid Z \Rightarrow \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] = 0\]

GCM test in practice: obtain empirical version of GCM
\[\sum_{j = 1}^N (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]
- Given i.i.d. data, use arbitrary machine learning models and
  - Regress \(Y\) on \(Z\) and obtain estimate \(\hat h\) for \(\mathbb{E}[Y | Z]\)
  - Regress \(X\) on \(Z\) and obtain estimate \(\hat f\) for \(\mathbb{E}[X | Z]\)
- empirical GCM is residualized version of GAX (rGAX)
- Under mild rate conditions (similar to DML Chernozhukov et al. (2017)):
  - \(\frac{1}{\sqrt{N}}\sum_{i = 1}^N (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i)) \leadsto \mathcal{N}(0,\sigma^2)\)
  - Valid (frequentist) inference possible

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Consider a PLLM and let \(X \in \mathbb{R}^{d_X}\) with \(\operatorname{Var}(X \mid Z)\) a.s. positive semidefinite. Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0.\]

Consider a PLLM and let \(X \in \mathbb{R}\) with \(\operatorname{Var}(X \mid Z) > 0\) a.s. Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)]).\]

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Non-parametric test can be used in semiparametric model \(\rightarrow\) nicely interpretable

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Consider a PLLM and let \(X \in \mathbb{R}\) with \(\operatorname{Var}(X \mid Z) > 0\) a.s. Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)]).\]

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Non-parametric test can be used in semiarametric model \(\rightarrow\) nicely interpretable

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Consider a PLLM and let \(X \in \mathbb{R}\) with \(\operatorname{Var}(X \mid Z) > 0\) a.s. Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)]).\]

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Non-parametric test can be used in semiparametric model \(\rightarrow\) nicely interpretable

Key takeaways

Use rGAX instead of GAX for shooting skill evaluation:
- rGAX and GAX directly comparable:
  - Defined on the same scale
  - Both (can) use the same xG model
- rGAX allow valid (frequentist) uncertainty quantification when using machine learning models
- rGAX can be interpreted as player strength estimate in semiparametric model
- Domain-specific interpretation for regression on \(X\): account for the fact that skilled players possess unique shooting profile
- rGAX address existing issues with GAX (see also examples)

rMetrics: Framework generalizable to any metric of the form
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j\]
- \(Y\) need not be binary
rMetrics (and GCM test results) conveniently obtained via comets R package (Kook and Lundborg 2024)

Application

Results: Evaluating shooting skills

Freely available event stream data from Hudl-Statsbomb
- 2015/16 season of the big 5 European leagues
- 45197 shots (4308 goals)
- 728 relevant players
Shot specific features \(Z\):
- Preprocessed features (22 Variables)
- Team information via Poisson strength model (defensive strength)
xG model (\(\hat h\)):
- Tuned XGBoost model
Model for \(X\) regression (\(\hat f\)):
- Tuned random forest model

Evaluating shooting skills in soccer

rGAX vs GAX

Are rGAX really better than GAX?

rGAX vs GAX

Are rGAX really better than GAX?

Recall recent criticism of GAX (Baron et al. 2024; Davis and Robberechts 2024):

Instability and limited replicability (over seasons)
- rGAX provide new opportunities by considering confidence intervals and \(p\)-values over seasons
Low (effective) sample size and hence high uncertainty
- Innate problem to soccer (shots are rare!), not possible to directly remedy with methodological advancement
- rGAX allow to quantify uncertainty
Biases arising from data
- rGAX reduce bias by explicitly modeling propensity of a player taking a shot given the circumstances of the shot (Regression of \(X\) on \(Z\))

rGAX vs GAX: Robustness against data selection

How are GAX and rGAX affected by data selection?

rGAX vs GAX: Robustness against data selection

How are GAX and rGAX affected by data selection?

Illustrative Example: Messi data \(\rightarrow\) Hudl-Statsbomb provides event stream data from all Messi matches at FC Barcelona.

rGAX vs GAX: Robustness against data selection

How are GAX and rGAX affected by data selection?

Illustrative Example: Messi data \(\rightarrow\) Hudl-Statsbomb provides event stream data from all Messi matches at FC Barcelona.

Fit 3 different xG models:

Model trained on all data:
- Since all data contains an overrepresentation of Messi (and FC Barcelona) data \(\rightarrow\) proxy for model representing high-quality shooters
Model trained on 2015/16 data of top 5 European leagues:
- Still overrepresentation of top players, but more balanced.
Model trained on shots from players with less than 30 shots observed in the data
- Since data contains only low-frequency shooters \(\rightarrow\) proxy for model representing low-quality shooters

rGAX vs GAX: Robustness against data selection

Women’s soccer

Extensions to survival data

Survival setup:
- \(Y := \min(Y^*, C) \dots\) censored time-to-event outcome (time to first injury of a player).
- \(\delta := 1(Y^* \leq C) \dots\) censoring indicator (\(\delta = 1\) censored, \(\delta = 0\) not censored)
- \(\Lambda(y, w) = -\log(1 - F_{Y\mid X , Z}(y \mid x,z)) \dots\) cumulative hazard for conditional cdf \(F\)
- \((X,Z) \dots\) player indicator and features

Injuries above expectation (IAX):
\[\widehat{\operatorname{IAX}} := \sum_{j=1}^n (\delta_j - \hat\Lambda(Y_j, Z_j)) X_j\]
- Correspond to score test statistic in classical Cox model
- Valid uncertainty quantification only under linear model assumptions

Residualized version (rIAX)
\[\widehat{\operatorname{rIAX}} := \sum_{i=1}^n (\delta_i - \hat\Lambda(Y_i, Z_i)) (X_i - \hat{f}(Z_i))\]
- Allow for testing \(H_0: \beta = 0\) in partially linear classical Cox model
- Valid uncertainty quantification when using machine learning models to estimate \(\Lambda\)

Injury-proneness in soccer

Injury data from injurytools package in R
- 2017/18 and 2018/19 season of Liverpool FC
- 42 observations (33 events, 9 censored)
- 28 player
Injury specific features \(Z\):
- Player specific information: Age, height, position
- Number of yellow and red cards until injury
Models:
- Suvival forest
- Random forest for \(X\) regression

Injury-proneness in soccer

Conclusion

In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.
If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use rGAX.
- Valid p-values, confidence intervals and so on
If you want interpretation: rGAX directly related to parameter in popular semi-parametric model!

Outlook:

General framework usable beyond player evaluation via GAX:
- Enhancing player evaluation metrics such as EGA (soccer), EPA (American football), SI (basketball), …
- Coverage type in the NFL
- Identifying drivers of injuries in survival setup

Conclusion

In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.
If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use rGAX.
- Valid p-values, confidence intervals and so on
If you want interpretation: rGAX directly related to parameter in popular semi-parametric model!
General framework usable beyond player evaluation via GAX:
- Enhancing player evaluation metrics: EGA (soccer), EPA (American football), SI (basketball), …
- Goalkeeper evaluation
- Identifying relevant variables for xG
- Coverage type in the NFL
- Identifying drivers of injuries in survival setup

Thank you for your attention!

References

Anzer, Gabriel, and Pascal Bauer. 2021. “A Goal Scoring Probability Model for Shots Based on Synchronized Positional and Event Data in Football (Soccer).” Frontiers in Sports and Active Living 3: 53. https://doi.org/10.3389/fspor.2021.624475.

Baron, Ethan, Nathan Sandholtz, Devin Pleuler, and Timothy C. Y. Chan. 2024. “Miss It Like Messi: Extracting Value from Off-Target Shots in Soccer.” Journal of Quantitative Analysis in Sports 20 (1): 37–50. https://doi.org/10.1515/jqas-2022-0107.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. “Double/Debiased/Neyman Machine Learning of Treatment Effects.” American Economic Review 107 (5): 261–65. https://doi.org/10.1257/aer.p20171038.

Davis, Jesse, and Pieter Robberechts. 2024. “Biases in Expected Goals Models Confound Finishing Ability.” https://arxiv.org/abs/2401.09940.

Hewitt, James H., and Oktay Karakuş. 2023. “A Machine Learning Approach for Player and Position Adjusted Expected Goals in Football (Soccer).” Franklin Open 4: 100034. https://doi.org/10.1016/j.fraope.2023.100034.

Kook, Lucas, and Anton Rask Lundborg. 2024. “Algorithm-Agnostic Significance Testing in Supervised Learning with Multimodal Data.” Briefings in Bioinformatics 25 (6). https://doi.org/10.1093/bib/bbae475.

Pollard, Richard, and Charles Reep. 1997. “Measuring the Effectiveness of Playing Strategies at Soccer.” Journal of the Royal Statistical Society: Series D (The Statistician) 46 (4): 541–50. https://doi.org/10.1111/1467-9884.00108.

Robberechts, Pieter, and Jesse Davis. 2020. “How Data Availability Affects the Ability to Learn Good xG Models.” In Machine Learning and Data Mining for Sports Analytics, edited by Ulf Brefeld, Jesse Davis, Jan Van Haaren, and Albrecht Zimmermann, 17–27. Cham: Springer International Publishing.

Shah, Rajen D., and Jonas Peters. 2020. “The hardness of conditional independence testing and the generalised covariance measure.” The Annals of Statistics 48 (3): 1514–38. https://doi.org/10.1214/19-AOS1857.

Appendix

The essentials of xG models

Earliest version of xG dates back to Pollard and Reep (1997):
- Logistic regression model on binary shot outcome
- Most important features: shot location and goal angle
- Distinction between kicked and headed shots

The essentials of xG models

Earliest version of xG dates back to Pollard and Reep (1997):
- Logistic regression model on binary shot outcome
- Most important features: shot location and goal angle
- Distinction between kicked and headed shots

The essentials of xG models

Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):
- Flexible machine learning methods \(\Rightarrow\) account for non-linearities and interactions:

The essentials of xG models

Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):
- Flexible machine learning methods \(\Rightarrow\) account for non-linearities and interactions:
  - Extreme gradient boosting machines (XGBoost)
  - Random forests
  - Neural networks

The essentials of xG models

Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):
- Flexible machine learning methods \(\Rightarrow\) account for non-linearities and interactions:
  - Extreme gradient boosting machines (XGBoost)
  - Random forests
  - Neural networks
- Broad set of shot-specific features:
  - Classical features: Distance to goal, angle, body part
  - Extended features from event and tracking data: distances to defenders and goalkeeper, shot type and technique, speed of and space for shooter
- Trained on large amount of data and properly tuned

Shot-specific features

Goalkeeper evaluation: GSAX vs RGSAX

GAX for player evaluation

rGAX and inference

Importance of xG model