Vienna University of Economics and Business
Nov 22, 2024
How can evaluate the shooting skills of a soccer player?
Count how many goals a player scored during a season.
Count xG generated from shots and compare to actual goals (in a season).
Idea: Test whether player significantly impacts outcome of a shot.
Event Stream Data:
Features for each shot:
Hewitt and Karakuş (2023)
Goal: Infer effect of a specific player on the outcome of a shot
Naive approach: Fit GLM (logistic regression) including all players and other features and use classical Wald tests.
What about fitting many small models with only one player:
Question: How does Bayesian model averaging perform?
Side note: In a current project we use algorithm (i.e. model) agnostic conditional independence tests (COMETs) for this problem.
Logistic regression setup:
\[Y = I_{\{X + Z + L + \epsilon > 0\}}, \quad \epsilon \sim \text{logistic(0,1)}.\] \(X,Z,\) and \(L\) independent. But all of them influence outcome variable.
Consinder fitting two GLMs (both misspecified):
\(m_1:\) \[\log(\frac{\pi(X)}{1-\pi(X)}) = \beta_1 X\]
\(m_2:\) \[\log(\frac{\pi(X,Z)}{1-\pi(X,Z)}) = \beta_1 X + \beta_2 Z\]
Estimate for \(\beta_1\) in 1000 simulations (\(m_3\) is correct model):
As comparison: \(Y = X + Z + L - L_1 + \epsilon, \quad \epsilon \sim \text{N(0,1)}.\)
Logistic regression setup:
\[Y = I_{\{X + Z + L - L_1 + \epsilon > 0\}}, \quad \epsilon \sim \text{logistic(0,1)}.\] \(X,Z,L,\) and \(L_1\) independent. But all of them influence outcome variable. Furthermore 11 noise (irrelevant) variables in data. Total: 15 regressors (4 of them relevant).
Compare two cases:
BMA with prior inclusion probability of 0.5 for all regressors (a priori models with ~7-8 vars are preferred)
BMA with prior inclusion probability of 0.3 (a priori models with 5 vars are preferred)
Posterior means from 1000 simulations:
As comparison: \(Y = X + Z + L - L_1 + \epsilon, \quad \epsilon \sim \text{N(0,1)}.\)