Vienna University of Economics and Business
Jun 5, 2025
How can we measure team or player performance?
Problems:
Solution: expected goals (xG) models
How can we identify outstanding shooters?
Treat xG as performance measure for average player
Compare actual outcome of shot to expected outcome
How can we identify outstanding shooters?
Treat xG as performance measure for average player
Compare actual outcome of shot to expected outcome
Goals above expectation (GAX)1: over time frame (e.g. season), compute difference between goals and xG for all shots of player \(i\)
\[ \operatorname{GAX}_i = \sum_{j=1}^{N_i} (Y_j - \hat h(Z_j))\] \(Y_j\) … actual outcome for shot \(j\)
\(\hat h(Z_j)\) … estimator for \(h(Z) = \mathbb{E}[Y|Z]\) (xG for shot \(j\))
GAX not optimal for evaluating shooting skills:
Low stability (Baron et al. 2024):
High variance and no (direct) uncertainty quantification (Davis and Robberechts 2024)
Biases in traditional xG models due to overrepresented players and team strengths (Davis and Robberechts 2024)
How can we identify outstanding shooters?
Logistic regression model:
\(Y \mid X,Z \sim \operatorname{Ber}(\pi(X,Z)), \quad \pi(X,Z) = P(Y=1 \mid X,Z)\) and
\[ \begin{aligned} \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + Z^{\top}\gamma. \end{aligned} \]
Goal: Inference on \(\beta\).
Given i.i.d data \((Y_i,X_i,Z_i)_{i = 1}^N\) from the logistic regression model
\[\sum_{i = 1}^N\frac{\partial\log L(\beta,\gamma \mid Y_i,X_i,Z_i)}{\partial \beta}\]
Score test on \(\beta\) uses score under \(H_0: \beta = 0\):
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
Since \(X_j\) binary \(\Rightarrow\) score is exactly GAX for a player
Conclusion: GAX relates to a classical score test in logistic regression model
Problems:
traditional GAX via machine learning model:
How can we identify outstanding shooters?
Problem reformulation (\(Y\), \(X\), and \(Z\) as before): partially linear logistic regression model (PLLM), where
\[ \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + g(Z) \]
Under PLLM: Test for \(Y\) conditionally independent of \(X\) given \(Z\) (\(Y \perp\!\!\!\perp X \mid Z\)) \(\Leftrightarrow\) Test for \(H_0 : \beta = 0\)
Generalised Covariance Measure:
\[ \operatorname{GCM} = \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] =\mathbb{E}[(Y - \mathbb{E}[Y | Z])(X - \mathbb{E}[X | Z])]\]
Basis for GCM test:
\[Y \perp\!\!\!\perp X \mid Z \Rightarrow \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] = 0\]
GCM test in practice:
Takeaways:
Proposition
Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]
Takeaways:
Proposition
Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]
Proposition
Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]
Takeaways:
Proposition
Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]
Proposition
Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]
Takeaways:
GAX in parametric model:
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
GAX via machine learning:
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_i))X_j\]
RGAX: Use sample GCM as score
\[\sum_{j=1}^{N} (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]
GAX in parametric model:
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
GAX via machine learning:
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_i))X_j\]
RGAX: Use sample GCM as score
\[\sum_{j=1}^{N} (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]
GAX in parametric model:
\[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]
GAX via machine learning:
\[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_i))X_j\]
RGAX: Use sample GCM as score
\[\sum_{j=1}^{N} (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]
Freely available event stream data from Statsbomb.
Shot specific features \(Z\):
xG model:
RGAX and GCM test results conveniently obtained via comets R package (Kook 2025)
In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.
If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use RGAX.
If you want interpretation: RGAX directly related to parameter in popular semi-parametric model!
Outlook:
General framework usable beyond player evaluation via GAX:
In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.
If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use RGAX.
If you want interpretation: RGAX directly related to parameter in popular semi-parametric model!
General framework usable beyond player evaluation via GAX:
Thank you for your attention!
Earliest version of xG dates back to Pollard and Reep (1997):
Logistic regression model on binary shot outcome
Most important features: shot location and goal angle
Distinction between kicked and headed shots
Earliest version of xG dates back to Pollard and Reep (1997):
Logistic regression model on binary shot outcome
Most important features: shot location and goal angle
Distinction between kicked and headed shots
Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):
Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):
Flexible machine learning methods \(\Rightarrow\) account for non-linearities and interactions:
Extreme gradient boosting machines (XGBoost)
Random forests
Neural networks
Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):
Flexible machine learning methods \(\Rightarrow\) account for non-linearities and interactions:
Extreme gradient boosting machines (XGBoost)
Random forests
Neural networks
Broad set of shot-specific features:
Classical features: Distance to goal, angle, body part
Extended features from event and tracking data: distances to defenders and goalkeeper, shot type and technique, speed of and space for shooter