Expected Goals (xG) and Goals Above Expectation (GAX)

Motivation

How can we measure team or player performance?

  • Classical approaches: Count-based statistics such as goals, assists, shots, etc.

  • Problems:

    • Goals are rare (only 10% of shots end up in a goal)
    • Chances/shots are not created equally
    • Assessing shots by binary outcome (goal/no goal) is not adequate (loss of information)!
  • Solution: expected goals (xG) models

    • Use statistical models to assign a probability of scoring to each shot
    • Take into account shot-specific features
    • Evaluate players and teams based on aggregation of xG values

The essentials of xG models

  • Earliest version of xG dates back to Pollard and Reep (1997):

    • Logistic regression model on binary shot outcome

    • Most important features: shot location and goal angle

    • Distinction between kicked and headed shots

The essentials of xG models

  • Earliest version of xG dates back to Pollard and Reep (1997):

    • Logistic regression model on binary shot outcome

    • Most important features: shot location and goal angle

    • Distinction between kicked and headed shots

The essentials of xG models

The essentials of xG models

The essentials of xG models

  • Modern xG Models (Robberechts and Davis 2020; Anzer and Bauer 2021; Hewitt and Karakuş 2023):

    • Flexible machine learning methods \(\Rightarrow\) account for non-linearities and interactions:

      • Extreme gradient boosting machines (XGBoost)

      • Random forests

      • Neural networks

    • Broad set of shot-specific features:

      • Classical features: Distance to goal, angle, body part

      • Extended features from event and tracking data: distances to defenders and goalkeeper, shot type and technique, speed of and space for shooter

    • Trained on large amount of data and properly tuned

Shot-specific features

Shot-specific features

Advanced xG models

Team evaluation via xG

xG for player evaluation

How can we identify outstanding shooters?

  • Treat xG as performance measure for average player

  • Compare actual outcome of shot to expected outcome

xG for player evaluation

How can we identify outstanding shooters?

  • Treat xG as performance measure for average player

  • Compare actual outcome of shot to expected outcome

  • Goals above expectation (GAX)1: over time frame (e.g. season), compute difference between goals and xG for all shots of player \(i\)

\[ \operatorname{GAX}_i = \sum_{j=1}^{N_i} (Y_j - \hat h(Z_j))\] \(Y_j\) … actual outcome for shot \(j\)

\(\hat h(Z_j)\) … estimator for \(h(Z) = \mathbb{E}[Y|Z]\) (xG for shot \(j\))

GAX criticism

GAX not optimal for evaluating shooting skills:

A semiparametric approach toward GAX

A parametric model

How can we identify outstanding shooters?

  • Logistic regression model:

    \(Y \mid X,Z \sim \operatorname{Ber}(\pi(X,Z)), \quad \pi(X,Z) = P(Y=1 \mid X,Z)\) and

    \[ \begin{aligned} \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X\beta + Z^{\top}\gamma. \end{aligned} \]

    • \(Y\) … binary outcome of a shot (goal/no goal)
    • \(X\) … binary indicator of a players’ involvement (shooter/not shooter)
    • \(Z\) … shot-specific variables
  • Goal: Inference on \(\beta\).

    • Wald test, LR test, score test

Score test and GAX

    Given i.i.d data \((Y_i,X_i,Z_i)_{i = 1}^N\) from the logistic regression model

    • Score (target of score tests):

    \[\sum_{i = 1}^N\frac{\partial\log L(\beta,\gamma \mid Y_i,X_i,Z_i)}{\partial \beta}\]

    • Score test on \(\beta\) uses score under \(H_0: \beta = 0\):

      \[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]

      • \(\hat h(Z_j) = \text{expit} (Z_j^{\top}\hat \gamma)\)\(\hat \gamma\) is the MLE of \(\gamma\) under \(H_0\).
    • Since \(X_j\) binary \(\Rightarrow\) score is exactly GAX for a player

      • High GAX (in absolute terms) correctly standardized indicates significant players

GAX for player evaluation

  • Conclusion: GAX relates to a classical score test in logistic regression model

    • Uncertainty quantification (and significance testing) via score test
    • Interpretation of player quality as effect on log-odds (probability) of scoring
  • Problems:

    • Linear model assumptions unrealistic
    • Biases arising from taking into account only shot-specific variables
    • High-dimensionality if accounting for team, goalkeeper or position effects
  • traditional GAX via machine learning model:

    • \(\sum_{j=1}^{N} (Y_j - \hat{h}(Z_j))X_j, \quad \hat h\) estimated via arbitrary ML algorithm
    • No (valid) uncertainty quantification

A semi parametric reformulation

How can we identify outstanding shooters?

  • Problem reformulation (\(Y\), \(X\), and \(Z\) as before): partially linear logistic regression model (PLLM)

    \[ Y := \mathbb{I}\left(X\beta + g(Z) > \varepsilon\right) \quad \mbox{and}\quad X := f(Z) + \eta, \]

    • \(\varepsilon \sim \operatorname{Logistic}(0,1)\)
    • \(\mathbb{E}[\varepsilon | X, Z] = 0\), \(\mathbb{E}[\eta | Z] = 0\)
    • \(g\), \(f\) arbitrary measurable functions
  • Under PLLM: Test for \(Y\) conditionally independent of \(X\) given \(Z\) (\(Y \perp\!\!\!\perp X \mid Z\)) \(\Leftrightarrow\) Test for \(H_0 : \beta = 0\)

    • Under \(H_0\): No modelling assumptions on relationship between \(Y\) and \(Z\), and \(X\) and \(Z\)

GCM test

  • Generalised Covariance Measure:

\[ \operatorname{GCM} = \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] =\mathbb{E}[(Y - \mathbb{E}[Y | Z])(X - \mathbb{E}[X | Z])]\]

  • Basis for GCM test:

\[Y \perp\!\!\!\perp X \mid Z \Rightarrow \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] = 0\]

GCM test

  • Test for GCM = 0: given i.i.d. data, use arbitrary machine learning algorithms and

    • Regress \(Y\) on \(Z\) and obtain estimate \(\hat h\) for \(\mathbb{E}[Y | Z]\)
    • Regress \(X\) on \(Z\) and obtain estimate \(\hat f\) for \(\mathbb{E}[X | Z]\)
  • Sample version of GCM:

    \[\operatorname{\widehat{GCM}} : = \sum_{i = 1}^N (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]

  • Under mild rate conditions (similar to DML Chernozhukov et al. (2018)):

    \[\frac{1}{\sqrt{N}} \operatorname{\widehat{GCM}} \leadsto \mathcal{N}(0,\sigma^2) \]

    • \(\sigma^2\) consistently estimated by empirical variance of \(\operatorname{\widehat{GCM}}\).

GCM test and PLLM

    • Interested in testing \(H_0 : \beta = 0\)

    Proposition

    Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]

    • Takeaways:

      • GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
      • Interpretation: identify players with significant (positive) impact on probability of success from shot

GCM test and PLLM

    • Interested in testing \(H_0 : \beta = 0\)

    Proposition

    Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]

    Proposition

    Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]

    • Takeaways:

      • GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
      • Interpretation: identify players with significant (positive) impact on probability of success from shot

GCM test and PLLM

    • Interested in testing \(H_0 : \beta = 0\)

    Proposition

    Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]

    Proposition

    Consider a PLLM and let \(X\) be a binary variable with \(P(X = 1 | Z) > 0\). Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]

    • Takeaways:

      • GCM test allows directional testing: alternatives of the form \(H_1 : \beta > 0\)
      • Interpretation: identify players with significant (positive) impact on probability of success from shot

GAX, scores and GCM

  • GAX in parametric model:

    \[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]

    • Only valid inference if linear model assumption hold
  • GAX via machine learning:

    \[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_i))X_j\]

    • \(\hat h\) learned via arbitrary ML algorithm
    • No valid inference
  • RGAX: Use sample GCM as score

    \[\sum_{j=1}^{N} (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]

    • Doubly robust score (rate conditions fulfilled) \(\Rightarrow\) valid inference
    • Interpretation RGAX: additional regression accounts for whether a player would take the shot under the circumstances described by \(Z_i\)

GAX, scores and GCM

  • GAX in parametric model:

    \[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]

    • Only valid inference if linear model assumption hold
  • GAX via machine learning:

    \[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_i))X_j\]

    • \(\hat h\) learned via arbitrary ML algorithm
    • No valid inference
  • RGAX: Use sample GCM as score

    \[\sum_{j=1}^{N} (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]

    • Doubly robust score (rate conditions fulfilled) \(\Rightarrow\) valid inference
    • Interpretation RGAX: additional regression accounts for whether a player would take the shot under the circumstances described by \(Z_i\)

GAX, scores and GCM

  • GAX in parametric model:

    \[\sum_{j=1}^{N} (Y_j - \hat h(Z_j))X_j\]

    • Only valid inference if linear model assumption hold
  • GAX via machine learning:

    \[\sum_{j=1}^{N} (Y_j - \hat{h}(Z_i))X_j\]

    • \(\hat h\) learned via arbitrary ML algorithm
    • No valid inference
  • RGAX: Use sample GCM as score

    \[\sum_{j=1}^{N} (Y_i-\hat h(Z_i))(X_i - \hat f(Z_i))\]

    • Doubly robust score (rate conditions fulfilled) \(\Rightarrow\) valid inference
    • Interpretation RGAX: additional regression accounts for whether a player would take the shot under the circumstances described by \(Z_i\)

Application

Quick data overview

  • Freely available event stream data from Statsbomb.

    • 2015/16 season of the big 5 European leagues
    • 45198 shots (4308 goals)
    • 1047 relevant players
  • Shot specific features \(Z\):

    • Preprocessed features (22 Variables)
    • Goalkeeper information (which GK in goal)
    • Team information
  • xG model:

    • Simple GLM
    • XGBoost model (properly tuned)

xG and GAX results

player

xG (GLM)

xG (xgb)

GAX (GLM)

GAX (xgb)

goals

n

Cristiano Ronaldo dos Santos Aveiro

33.45

31.69

-4.45

-2.69

29

224

Luis Alberto Suárez Díaz

29.93

29.76

7.07

7.24

37

134

Gonzalo Gerardo Higuaín

26.07

24.44

5.93

7.56

32

176

Zlatan Ibrahimović

24.55

24.50

6.45

6.50

31

142

Robert Lewandowski

24.46

24.39

3.54

3.61

28

147

Pierre-Emerick Aubameyang

23.45

24.70

-1.45

-2.70

22

112

Karim Benzema

23.38

23.22

0.62

0.78

24

100

Neymar da Silva Santos Junior

22.30

20.46

-3.30

-1.46

19

116

Edinson Roberto Cavani Gómez

22.18

20.00

-3.18

-1.00

19

85

Lionel Andrés Messi Cuccittini

21.09

19.39

1.91

3.61

23

151

Harry Kane

19.81

17.80

0.19

2.20

20

153

Alexandre Lacazette

18.93

18.89

0.07

0.11

19

92

Aritz Aduriz Zubeldia

17.88

16.10

-0.88

0.90

17

91

Romelu Lukaku Menama

17.68

17.23

-0.68

-0.23

17

114

Michy Batshuayi Tunga

16.93

18.16

-1.93

-3.16

15

121

RGAX vs GAX

RGAX and GCM test results conveniently obtained via comets R package (Kook 2025)

RGAX vs GAX

RGAX and GCM test results conveniently obtained via comets R package (Kook 2025)

RGAX vs GAX

RGAX and GCM test results conveniently obtained via comets R package (Kook 2025)

Rankings GAX vs GCM GAX

player

GAX (xgb)

RGAX

Rank GAX (xgb)

Rank RGAX

Difference

Cristiano Ronaldo dos Santos Aveiro

-2.69

1.13

1,015

185

830

Neymar da Silva Santos Junior

-1.46

0.73

949

265

684

Edinson Roberto Cavani Gómez

-1.00

0.50

885

330

555

Gnégnéri Yaya Touré

-0.06

1.23

635

170

465

Jonathan Walters

-0.58

0.34

794

369

425

Francisco Alcácer García

-0.26

0.36

697

361

336

Sloan Privat

0.06

0.72

579

267

312

Adam David Lallana

0.34

1.08

466

190

276

Lucas Vázquez Iglesias

0.20

0.79

529

254

275

Toby Alderweireld

0.03

0.50

599

328

271

Karim Benzema

0.78

1.99

335

84

251

Lorenzo Insigne

0.69

1.74

358

109

249

Salif Sané

0.21

0.64

521

282

239

Toni Kroos

0.06

0.40

582

351

231

Axel Ngando

-0.02

0.28

622

393

229

Conclusion

  • In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.

  • If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use RGAX.

  • If you want interpretation: GCM provides it in form of a popular semi-parametric model!

Outlook:

  • General framework usable beyond player evaluation via GAX:

    • Player evaluation in basketball
    • Identifying drivers of injuries in survival setup
    • Analyzing impact of features for coverage prediction in NFL

Conclusion

  • In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.

  • If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use RGAX.

  • If you want interpretation: GCM provides it in form of a popular semi-parametric model!

Outlook:

  • General framework usable beyond player evaluation via GAX:

    • Player evaluation in basketball
    • Identifying drives of injuries in survival setup
    • Analyzing impact of features for coverage prediction in NFL

Conclusion

  • In a logistic regression model: GAX is directly related to a score test on a players effect on the probability of goal.

  • If you don’t believe the GLM setup: GAX using ML models does not allow valid inference! \(\Rightarrow\) Residualize \(X\) as well, i.e. use RGAX.

  • If you want interpretation: GCM provides it in form of a popular semi-parametric model!

Outlook:

  • General framework usable beyond player evaluation via GAX:

    • Player evaluation in basketball
    • Identifying drives of injuries in survival setup
    • Analyzing impact of features for coverage prediction in NFL

Thank you for your attention!

References

Anzer, Gabriel, and Pascal Bauer. 2021. A Goal Scoring Probability Model for Shots Based on Synchronized Positional and Event Data in Football (Soccer).” Frontiers in Sports and Active Living 3: 53. https://doi.org/10.3389/fspor.2021.624475.
Baron, Ethan, Nathan Sandholtz, Devin Pleuler, and Timothy C. Y. Chan. 2024. Journal of Quantitative Analysis in Sports 20 (1): 37–50. https://doi.org/doi:10.1515/jqas-2022-0107.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. http://www.jstor.org/stable/45172267.
Davis, Jesse, and Pieter Robberechts. 2024. “Biases in Expected Goals Models Confound Finishing Ability.” https://arxiv.org/abs/2401.09940.
Hewitt, James H., and Oktay Karakuş. 2023. “A Machine Learning Approach for Player and Position Adjusted Expected Goals in Football (Soccer).” Franklin Open 4: 100034. https://doi.org/10.1016/j.fraope.2023.100034.
Kook, Lucas. 2025. comets: Covariance Measure Tests for Conditional Independence. https://doi.org/10.32614/CRAN.package.comets.
Pollard, Richard, and Charles Reep. 1997. “Measuring the Effectiveness of Playing Strategies at Soccer.” Journal of the Royal Statistical Society: Series D (The Statistician) 46 (4): 541–50. https://doi.org/10.1111/1467-9884.00108.
Robberechts, Pieter, and Jesse Davis. 2020. How Data Availability Affects the Ability to Learn Good xG Models.” In Machine Learning and Data Mining for Sports Analytics, edited by Ulf Brefeld, Jesse Davis, Jan Van Haaren, and Albrecht Zimmermann, 17–27. Cham: Springer International Publishing.
Shah, Rajen D., and Jonas Peters. 2020. The hardness of conditional independence testing and the generalised covariance measure.” The Annals of Statistics 48 (3): 1514–38. https://doi.org/10.1214/19-AOS1857.