Machine learning based statistical inference in sports analytics

A general framework for valid statistical inference in an interpretable semi-parametric model.

Authors

Robert Bajons

Joint work with Lucas Kook

Published

February 1, 2025

Hit Counter by Digits

Resource Date Link
Presentation given at the International Workshop on Statistical Modelling 2025 2025-07-16 Presentation (IWSM 25)
Contribution to proceedings of the International Workshop on Statistical Modelling 2025 2024-07-15 Short Paper (IWSM 25)
Talk presented at the Sports Analytics Workshop (SAW) 2025 at AUEB 2025-05-06 Presentation (SAW 25)

Note

This project is intimately related to the project: rGAX: Rethinking goals above expectation (GAX). Furthermore, it also has some connections to the project HMMotion: Predicting coverage schemes in the NFL

Overview

Sports analytics, fueled by the recent availability of high-resolution tracking data, has experienced a surge in the use of advanced statistical and machine learning (ML) models. A key focus of these applications is identifying the factors that influence a game, for instance, identifying top players, predictors of injuries, or factors influencing the final score.

Commonly, the task of identifying influential factors is tackled by fitting machine learning models and analyzing traditional variable importance measures. Alternatively (or additionally), researchers may compare the predictive power of models with and without factors of interest to tackle this problem. However, there are various pitfalls in these approaches impeding interpretation of the results. More importantly, uncertainty quantification and valid statistical inference become challenging.

In this project, we use a well-established nonparametric independence test (the Generalised Covariance Measure (GCM) test, see Shah and Peters 2020) to obtain inference in a partially generalized linear model. This allows the identification of features that may aid in outcome prediction in an interpretable way, without making strong modeling assumptions but maintaining valid statistical inference. The framework has various applications ranging from identifying important factors for defensive coverage detection in the NFL to player evaluation in many sports by adapting existing and popular metrics such as goals above expectation (GAX), expected goals added (EGA, both soccer), expected points added (EPA, American football), shooter impact (SI, Basketball), and many more.

References

Shah, Rajen D., and Jonas Peters. 2020. The hardness of conditional independence testing and the generalised covariance measure.” The Annals of Statistics 48 (3): 1514–38. https://doi.org/10.1214/19-AOS1857.