glmnet | xgboost | |||||
|---|---|---|---|---|---|---|
metric | pre | post | HMM | pre | post | HMM |
Accuracy | 0.8345 | 0.8453 | 0.8501 | 0.8573 | 0.8837 | 0.8753 |
AUC | 0.8282 | 0.8661 | 0.8703 | 0.8419 | 0.8907 | 0.8839 |
Logloss | 0.4035 | 0.3647 | 0.3612 | 0.3834 | 0.3240 | 0.3335 |
worst models | medium models | best models | ||||
Vienna University of Economics and Business
May 6, 2025
How to identify important predictors of a sport-specific outcome?
Sports analytics focuses on identifying the factors that influence a game:
Big data era:
How to identify important predictors of a sport-specific outcome?
Sports analytics focuses on identifying the factors that influence a game:
Big data era:
How to identify important predictors of a sport-specific outcome?
Parametric models (linear/logistic regression):
Machine learning models:
Do HMM features help in defensive coverage prediction in the NFL?
Goal: predict coverage type (1: man coverage, 0: zone coverage) for a defense on a play based on pre-snap features (derived from tracking data)
3 types of features:
Typical approach for answering the question:
Procedure:
glmnet | xgboost | |||||
|---|---|---|---|---|---|---|
metric | pre | post | HMM | pre | post | HMM |
Accuracy | 0.8345 | 0.8453 | 0.8501 | 0.8573 | 0.8837 | 0.8753 |
AUC | 0.8282 | 0.8661 | 0.8703 | 0.8419 | 0.8907 | 0.8839 |
Logloss | 0.4035 | 0.3647 | 0.3612 | 0.3834 | 0.3240 | 0.3335 |
worst models | medium models | best models | ||||
Problems:
Do HMM features help in defensive coverage prediction in the NFL?
Variable importance:
Predictive performance:
Answer to the question: not clear
How can we identify important predictors of an outcome?
Classical approach: logistic regression:
Partially linear logistic regression model (PLLM)
\[ \log\left(\frac{\pi(X,Z)}{1-\pi(X,Z)}\right) = X^{\top}\beta + g(Z) \]
Under PLLM: Test for \(Y\) conditionally independent of \(X\) given \(Z\) (\(Y \perp\!\!\!\perp X \mid Z\)) \(\Leftrightarrow\) Test for \(H_0 : \beta = 0\)
Generalised Covariance Measure:
\[ \operatorname{GCM} = \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] =\mathbb{E}[(Y - \mathbb{E}[Y | Z])(X - \mathbb{E}[X | Z])]\]
Basis for GCM test:
\[Y \perp\!\!\!\perp X \mid Z \Rightarrow \mathbb{E}[\operatorname{Cov}(Y,X \mid Z)] = 0\]
GCM test in practice:
Proposition
Consider a PLLM and let \(X \in \mathbb{R}^{d_X}\) with \(\operatorname{Cov}(X,X \mid Z)\) a.s. positively semidefinit. Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]
Takeaways:
Proposition
Consider a PLLM and let \(X \in \mathbb{R}^{d_X}\) with \(\operatorname{Cov}(X,X \mid Z)\) a.s. positively semidefinit. Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]
Proposition
Consider a PLLM and let \(X \in \mathbb{R}\) with \(\operatorname{Var}(X \mid Z) > 0\) a.s. Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]
Takeaways:
Proposition
Consider a PLLM and let \(X \in \mathbb{R}^{d_X}\) with \(\operatorname{Cov}(X,X \mid Z)\) a.s. positively semidefinit. Then \[\beta = 0 \Leftrightarrow \mathbb{E}[(\operatorname{Cov}(Y,X \mid Z)] = 0\]
Proposition
Consider a PLLM and let \(X \in \mathbb{R}\) with \(\operatorname{Var}(X \mid Z) > 0\) a.s. Then \[\operatorname{sign}(\beta) = \operatorname{sign}(\mathbb{E}[\operatorname{Cov}(Y,X \mid Z)])\]
Takeaways:
comets R package (Kook 2025)In this talk: binary outcome \(\rightarrow\) result generalizable to other cases
Coverage prediction model:
Testing for predictive power of HMM features:
Popular measure for shooting skill of players: Goals above expectation (GAX)
Solution:
player | GAX | RGAX | Rank GAX | Rank RGAX | Absolute Difference | Relative Difference |
|---|---|---|---|---|---|---|
Cristiano Ronaldo | -2.02 | 2.68 | 288 | 34 | 254 | 7.47 |
Miralem Pjanić | 1.83 | 3.27 | 65 | 20 | 45 | 2.25 |
Lorenzo Insigne | 0.06 | 1.80 | 162 | 50 | 112 | 2.24 |
Edinson Cavani | -0.90 | 1.22 | 232 | 73 | 159 | 2.18 |
Alessandro Florenzi | 2.27 | 3.59 | 47 | 15 | 32 | 2.13 |
Antoine Griezmann | 5.84 | 6.42 | 3 | 1 | 2 | 2.00 |
Mohamed Salah | 3.46 | 4.29 | 19 | 7 | 12 | 1.71 |
Roberto Firmino | 1.92 | 3.00 | 58 | 23 | 35 | 1.52 |
Lionel Messi | 2.49 | 3.44 | 42 | 17 | 25 | 1.47 |
Ivan Rakitić | 1.00 | 2.13 | 103 | 44 | 59 | 1.34 |
Identifying important factors for a sport-specific outcome difficult:
Machine learning based statistical inference in a semiparametric model:
R-package cometsVarious use cases:
Thank you for your attention!