rMetrics: A statistically motivated framework for player evaluation using residualized scores

Robert Bajons | Joint work with Lucas Kook

Vienna University of Economics and Business

Sep 27, 2025

Motivation

Player evaluation

How do we commonly evaluate players in sports?

Sports analytics:

Expected value metrics \(\rightarrow\) compare observed and expected outcome:
- Estimate expected outcome for game state (e.g. expected goals, expected pass completion)
- For all outcomes of a player (e.g. shots, passes): Compute differences between outcome and expected value for outcome.

Examples:
- Soccer and ice hockey: Goals (saved) above expectation (G(S)AX)
- Basketball: Quantified shooter impact (qSI)
- American football: Completion percentage above expectation (CPAE)

Motivational example

Goals above expectation (GAX):
over a time frame (e.g. a season), compute differences between goals and xG for all shots of player \(i\)
\[ \operatorname{GAX}_i = \sum_{j=1}^{N_i} (Y_j - \hat h(Z_j))\]
- \(N_i \dots\) Number of shots of player \(i\)
- \(Y_j \dots\) actual outcome for shot \(j\)
- \(\hat h(Z_j) \dots\) estimator for \(h(Z) = \mathbb{E}[Y|Z]\) (xG for shot \(j\))
- \(Z \dots\) shot-specific features (distances, angles, …)

Motivational example

Recent criticism of GAX (Baron et al. 2024; Davis and Robberechts 2024):

Instability and limited replicability (over seasons)
Low (effective) sample size \(\rightarrow\) high uncertainty, lack of uncertainty quantification
Biases arising from data:
- Top teams and top players taking more shots
- Including all shots instead of fraction (headers vs footers, long-distance vs short-distance)

\(\Rightarrow\) GAX have been labeled a poor measure for evaluating shooting skills

Motivational example

Recent criticism of GAX (Baron et al. 2024; Davis and Robberechts 2024):

Instability and limited replicability (over seasons)
Low (effective) sample size \(\rightarrow\) high uncertainty, lack of uncertainty quantification
Biases arising from data:
- Top teams and top players taking more shots
- Including all shots instead of fraction (headers vs footers, long-distance vs short-distance)

\(\Rightarrow\) GAX have been labeled a poor measure for evaluating shooting skills

Are GAX really a poor metric for player evaluation?

A framework for player evaluation

A parametric model

How can we identify outstanding shooters?

Setup:
- \(Y\) … binary outcome of a shot (goal/no goal)
- \(Z\) … shot-specific variables
- \(X\) … binary indicator of a player’s involvement (shooter/not shooter)

Logistic regression model:

\(Y \mid X,Z \sim \operatorname{Ber}(\pi(X,Z)), \quad \pi(X,Z) = P(Y=1 \mid X,Z)\) and

\[ \begin{aligned} \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + Z^{\top}\gamma. \end{aligned} \]

Goal: Inference on \(\beta\)
- Wald test, LR test, score test

Score test and GAX

Given i.i.d data \((Y_i,X_i,Z_i)_{i = 1}^N\) from the logistic regression model

Score of \(\beta\) (target of score tests): …………………………………………..

\[ \sum_{i = 1}^N\frac{\partial\log L(\beta,\gamma \mid Y_i,X_i,Z_i)}{\partial \beta}\]

Score test on \(\beta\) uses score under \(H_0: \beta = 0\):
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- \(\hat h(Z_j) = \text{expit} (Z_j^{\top}\hat \gamma)\) … \(\hat \gamma\) is the MLE of \(\gamma\) under \(H_0\)

Recall:

\[ \operatorname{GAX}_i = \sum_{j=1}^{N_i} (Y_j - \hat h(Z_j))\]

Since \(X_j\) binary \(\Rightarrow\) score is exactly GAX for a player
- High GAX (in absolute terms) correctly standardized indicates significant players

GAX for player evaluation

Conclusion: GAX relates to a classical score test in logistic regression model
- Uncertainty quantification (and significance testing) via score test
- Interpretation of player quality as effect on log-odds (probability) of scoring

Problems:
- Linear model assumptions unrealistic
- traditional GAX via machine learning model:
  - \(\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j, \quad \hat h\) estimated via arbitrary ML algorithm
  - No (valid) uncertainty quantification

A semiparametric extension

Partially linear logistic regression model (PLLM) ………………………………….

\[ \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + g(Z) \]

Goal: Inference on \(\beta\) (testing \(H_0 : \beta = 0\))
- Use Generalised Covariance Measure (GCM) test (Shah and Peters 2020)
- Target of GCM test: \(\operatorname{GCM} := \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] =\mathbb{E}[(Y - \mathbb{E}[Y | Z])(X - \mathbb{E}[X | Z])]\)

GCM test uses empirical GCM:
\[\sum_{j = 1}^N (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]
- residualized version of GAX (rGAX) …………………………………………………
- \(\hat h \dots\) estimator for \(h(Z)\) (xG model) using suitable machine learning model
- \(\hat f \dots\) estimator for \(f(Z) := \mathbb{E}[X \mid Z]\) using suitable machine learning model

GAX vs. rGAX

GAX in parametric model:
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- Only valid inference if linear model assumptions hold

GAX via machine learning:
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j\]
- \(\hat h\) learned via arbitrary ML algorithm
- No valid inference

rGAX:
\[\sum_{j=1}^{N} (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]
- Valid inference and uncertainty quantification even when using ML models for \(\hat h\) and \(\hat f\)
- Different interpretation of player value: strength estimate in a semiparametric model
- Interpretation of additional regression: accounts for whether a player would take the shot under the circumstances described by \(Z_i\)

GAX vs. rGAX

GAX in parametric model:
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- Only valid inference if linear model assumptions hold

GAX via machine learning:
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j\]
- \(\hat h\) learned via arbitrary ML algorithm
- No valid inference

rGAX:
\[\sum_{j=1}^{N} (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]
- Valid inference and uncertainty quantification even when using ML models for \(\hat h\) and \(\hat f\)
- Different interpretation of player value: strength estimate in a semiparametric model
- Interpretation of additional regression: accounts for whether a player would take the shot under the circumstances described by \(Z_i\)

GAX vs. rGAX

GAX in parametric model:
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- Only valid inference if linear model assumptions hold

GAX via machine learning:
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j\]
- \(\hat h\) learned via arbitrary ML algorithm
- No valid inference

rGAX:
\[\sum_{j=1}^{N} (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]
- Valid inference and uncertainty quantification even when using ML models for \(\hat h\) and \(\hat f\)
- Different interpretation of player value: strength estimate in a semiparametric model
- Interpretation of additional regression: accounts for whether a player would take the shot under the circumstances described by \(Z_i\)

Key takeaways

Use rGAX instead of GAX for shooting skill evaluation:
- rGAX and GAX directly comparable:
  - Defined on the same scale
  - Both (can) use the same xG model
- rGAX allow valid (frequentist) uncertainty quantification when using machine learning models
- rGAX can be interpreted as player strength estimate in semiparametric model
- rGAX address existing issues with GAX (see also examples)

rMetrics: Framework generalizable to any metric of the form
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j\]
- \(Y\) need not be binary (see examples)
rMetrics (and GCM test results) conveniently obtained via comets R package (Kook 2025)

Applications

Evaluating shooting skills in soccer

Freely available event stream data from Hudl-Statsbomb
- 2015/16 season of the big 5 European leagues
- 45197 shots (4308 goals)
- 728 relevant players
Shot specific features \(Z\):
- Preprocessed features (22 Variables)
- Team information via Poisson strength model (defensive strength)
xG model (\(\hat h\)):
- properly tuned XGBoost model
Model for \(X\) regression (\(\hat f\)):
- properly tuned random forest model

Evaluating shooting skills in soccer

rGAX vs GAX

Is rGAX really better than GAX?

rGAX vs GAX

Is rGAX really better than GAX?

Recall recent criticism of GAX (Baron et al. 2024; Davis and Robberechts 2024):

Instability and limited replicability (over seasons)
- rGAX provides new opportunities by considering \(p\)-values over seasons
Low (effective) sample size and hence high uncertainty
- Innate problem to soccer (shots are rare!), not possible to directly remedy with methodological advancement
- rGAX allows to quantify uncertainty
Biases arising from data
- rGAX reduces bias by explicitly modeling propensity of a player taking a shot given the circumstances of the shot (Regression of \(X\) on \(Z\))

rGAX vs GAX: Robustness against data selection

How are GAX and rGAX affected by data selection?

rGAX vs GAX: Robustness against data selection

How are GAX and rGAX affected by data selection?

Illustrative Example: Messi data \(\rightarrow\) Hudl-Statsbomb provides event stream data from all Messi matches at FC Barcelona.

rGAX vs GAX: Robustness against data selection

How are GAX and rGAX affected by data selection?

Illustrative Example: Messi data \(\rightarrow\) Hudl-Statsbomb provides event stream data from all Messi matches at FC Barcelona.

Fit 3 different xG models:

Model trained on all data:
- Since all data contains an overrepresentation of Messi (and FC Barcelona) data \(\rightarrow\) proxy for model representing high-quality shooters
Model trained on 2015/16 data of top 5 European leagues:
- Still overrepresentation of top players, but more balanced.
Model trained on shots from players with less than 30 shots observed in the data
- Since data contains only low-frequency shooters \(\rightarrow\) proxy for model representing low-quality shooters

rGAX vs GAX: Robustness against data selection

Evaluating shooting skills in basketball

Quantified Shooter impact (qSI):
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- \(Y \dots\) outcome of a shot (field goal or not)
- \(\hat h(Z) \dots\) shot quality model (analogue to xG model in soccer)

Use residualized version (rqSI) …………………………………..

\[\sum_{j=1}^{N} (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]

Example:
- NBA play-by-play data from 2022/23 (obtain via hoopR package, Gilani 2023)
- Compute qSI and rqSI for all player with at least 300 shots
- Use tuned random forest for regressions of \(Y\) on \(Z\) (shot quality model) and \(X\) on \(Z\)

Evaluating shooting skills in basketball

What about the distinction between 2 and 3 pointers?

Approach 1:
- Compute (r)qSI for 2 pointers and 3 pointers separately (similar to Metulini and Carre 2020 for qSI)
Approach 2:
- Consider different outcome \(Y\)
- Scaled version of scores (expected eFG, Chang et al. 2014)
- Model score directly, i.e. \(Y \in \{0,2,3\}\)
- Framework still holds \(\rightarrow\) valid uncertainty quantification using rqSI!

Evaluating shooting skills in basketball

Evaluating passing skills in American football

Completion (percentage) over expectation (CPOE):
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- \(Y \dots\) outcome of a pass (completion or not)
- \(\hat h(Z) \dots\) completion probability model (analogue to xG model in soccer)

Use residualized version (rCPOE) …………………………………..

\[\sum_{j=1}^{N} (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]

Example:
- NFL play-by-play data from 2022/23 (obtain via NFLfastR package, Carl and Baldwin 2024)
- Compute CPOE and rCPOE for all player with at least 300 passes
- Use tuned random forest for regressions of \(Y\) on \(Z\) (shot quality model) and \(X\) on \(Z\)

Evaluating passing skills in American football

Conclusion

Expected value metrics: compare an actual to an expected outcome
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
- Score statistic in a suitable generalized linear model (GLM)
- Interpretation of player quality as effect on (suitably transformed) outcome (\(\beta\) parameter in GLM)
- Valid uncertainty quantification only in GLM \(\rightarrow\) no machine learning models!
rMetrics: generalization using residualized scores
\[\sum_{j=1}^{N} (Y_j-\hat h(Z_j))(X_j - \hat f(Z_j))\]
- Valid uncertainty even when using suitable machine learning models for \(\hat h\) and \(\hat f\)
- Interpretation of player quality in a semiparametric model (\(\beta\) parameter in partially linear GLM)
- Addresses exitsting problems with expected value metrics via modern statistical tools:
  - Double/debiased machine learning (Chernozhukov et al. 2017)
  - nonparametric conditional independence tests (Shah and Peters 2020)

Work in progress

Advanced player evaluation: Added value by action ……………………….
- Compute (expected) value \(V(s)\) function for each game state \(s\)
- Compute differences of value before an action and after an action
- Sum over all action \(N_i\) of a player \(i\) \[PV_{i} = \sum_{j=1}^{N_i} (V(s_{j+1}) - V(s_{j}))\]

Examples:
- Valuing action by estimating probabilities (VAEP, soccer)
- On ball value (OBV, soccer)
- Expected points added (EPA, American football)

Example: Expected points added (EPA)

Expected points added: ……………………………………………………….
\[EPA_{i} = \sum_{j = 1}^{N} (\hat Y_j-\widehat{EP}(Z_j))X_j\]
- \[\begin{align}\label{eq:hf} \hat Y_j= \begin{cases} \widehat{EP}(Z_{j+1}), & \text{if team stays in possession} .\\ -\widehat{EP}(Z_{j+1}), & \text{if team loses possession} \\ S_j, & \text{if team scores}. \end{cases} \end{align}\]

Instead of using the same EP model for \(\hat{Y}_j\) and \(\widehat{EP}\), fit a model to \(\hat Y\) \(\rightarrow\) residualized EPA (rEPA):
\[\operatorname{rEPA} = \sum_{i = 1}^{N} (\hat Y_i-\hat{h}(Z_i))(X_i-\hat{f}(Z_i))\]
- \(\hat{h}(Z) \dots\) estimator for \(h(Z) = \mathbb{E}[\hat Y\mid Z]\)
- Under certain assumption \(\rightarrow\) allows for valid uncertainty quantification
- Cross fitting necessary for fitting models

Example: Expected points added (EPA)

Thank you for your attention!

(Scan QR code for slides, preprint, and more information!)

References

Baron, Ethan, Nathan Sandholtz, Devin Pleuler, and Timothy C. Y. Chan. 2024. “Miss It Like Messi: Extracting Value from Off-Target Shots in Soccer.” Journal of Quantitative Analysis in Sports 20 (1): 37–50. https://doi.org/10.1515/jqas-2022-0107.

Carl, Sebastian, and Ben Baldwin. 2024. : Functions to Efficiently Access NFL Play by Play Data. https://doi.org/10.32614/CRAN.package.nflfastR.

Chang, Y. H., R. Maheswaran, J. Su, S. Kwok, T. Levy, A. Wexler, and K. Squire. 2014. “Quantifying Shot Quality in the NBA.” In Proceedings of the 2014 MIT Sloan Sports Analytics Conference.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. “Double/Debiased/Neyman Machine Learning of Treatment Effects.” American Economic Review 107 (5): 261–65. https://doi.org/10.1257/aer.p20171038.

Davis, Jesse, and Pieter Robberechts. 2024. “Biases in Expected Goals Models Confound Finishing Ability.” https://arxiv.org/abs/2401.09940.

Gilani, Saiem. 2023. : Access Men’s Basketball Play by Play Data. https://doi.org/10.32614/CRAN.package.hoopR.

Kook, Lucas. 2025. Comets: Covariance Measure Tests for Conditional Independence. https://doi.org/10.32614/CRAN.package.comets.

Metulini, Rodolfo, and Mael Le Carre. 2020. “Measuring Sport Performances Under Pressure by Classification Trees with Application to Basketball Shooting.” Journal of Applied Statistics 47 (12): 2120–35. https://doi.org/10.1080/02664763.2019.1704702.

Shah, Rajen D., and Jonas Peters. 2020. “The hardness of conditional independence testing and the generalised covariance measure.” The Annals of Statistics 48 (3): 1514–38. https://doi.org/10.1214/19-AOS1857.

Appendix

GCM test details

Generalised Covariance Measure:

\[ \operatorname{GCM} = \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] =\mathbb{E}[(Y - \mathbb{E}[Y | Z])(X - \mathbb{E}[X | Z])]\]

Basis for GCM test:

\[Y \perp\!\!\!\perp X \mid Z \Rightarrow \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] = 0\]

GCM test in practice:
- Given i.i.d. data, use arbitrary machine learning models and
  - Regress \(Y\) on \(Z\) and obtain estimate \(\hat h\) for \(\mathbb{E}[Y | Z]\)
  - Regress \(X\) on \(Z\) and obtain estimate \(\hat f\) for \(\mathbb{E}[X | Z]\)
- Under mild rate conditions (similar to DML; Chernozhukov et al. (2017)):
  - Sample version of GCM: \(\operatorname{\widehat{GCM}} : = \sum_{i = 1}^N (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i)) \leadsto \mathcal{N}(0,\frac{\sigma^2}{N})\)

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Interpretation: identify players with significant (positive) impact on probability of success from shot

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Proposition

Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Interpretation: identify players with significant (positive) impact on probability of success from shot

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Proposition

Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]

Proposition

Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Interpretation: identify players with significant (positive) impact on probability of success from shot

GCM test and PLLM

Interested in testing \(H_0 : \beta = 0\) in PLLM

Proposition

Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]

Proposition

Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]

Takeaways:
- GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
- Interpretation: identify players with significant (positive) impact on probability of success from shot

rGAX: Tuning \(\hat f\)

rGAX: Doubly robustness

rqSI: Comparison of outcomes \(Y\)

Injury-proneness in soccer

Survival setup:
- \(Y := \min(Y^*, C) \dots\) censored time-to-event outcome
- \(\delta := 1(Y^* \leq C) \dots\) censoring indicator (\(\delta = 1\) censored, \(\delta = 0\) not censored)
- \(\Lambda(y, w) = -\log(1 - F_{Y\mid X , Z}(y \mid x,z)) \dots\) cumulative hazard for conditional cdf \(F\)
- \((X,Z) \dots\) player indicator and features

Injuries above expectation (IAX):
\[\widehat{\operatorname{IAX}} := \sum_{j=1}^n (\delta_j - \hat\Lambda(Y_j, Z_j)) X_j\]
- Correspond to score test statistic in classical Cox model
- Valid uncertainty quantification only under linear model assumptions

Residualized version (rIAX)
\[\widehat{\operatorname{rIAX}} := \sum_{i=1}^n (\delta_i - \hat\Lambda(Y_i, Z_i)) (X_i - \hat{f}(Z_i))\]
- Allow for testing \(H_0: \beta = 0\) in partially linear classical Cox model
- Valid uncertainty quantification when using machine learning models to estimate \(\Lambda\)