Adaptive generalized logistic lasso and its application to rankings in sports

Robert Bajons | Joint work with Kurt Hornik
Vienna University of Economics and Business

International Workshop on Statistical Modelling 2024

Jul 16, 2024

Introduction

Motivation

Innate interest in ranking competitors in many fields, e.g. economics, scientometrics, or sports.

Applications in sports:
- Ranking teams (Did Spain deserve the win at EURO 24?)
- Ranking players (Which player is the best in a tournament/a season/overall?)
- Meta rankings (All rankings differ, so can we say something about the best by taking all rankings/ratings into account?)

Often: Partial rankings (ratings) more appropriate than total ones (avoid insignificant differences).

Motivation

Ranking players usually done via classical lasso or ridge regularization (Gramacy, Jensen, and Taddy 2013; Hvattum 2019).

Masarotto and Varin (2012) use paired comparisons of sports teams (i.e. matches played) to rank them via the ranking lasso:
- Logistic regression with penalty on the coefficients to enforce implicit grouping.
- Ranking of teams into groups of same strength.

Use similar ideas for different problems:

Ranking soccer players based on their contribution to possessions.

Ranking hockey players based on their contribution to goals scored.

Hockey data

Binary outcome \(Y \Rightarrow\) logistic regression.
High-dimensional and sparse: \(N = 69449\), \(n_T = 330\), \(n_P = 2439\).
Estimate player and team effects \(\beta_P\) and \(\beta_T\), enforcing grouping (partial ranking).

Problem formulation

Generalized lasso type of problem:

\[\min_{\boldsymbol{\beta}} \ell\left(Y;X,\boldsymbol{\beta}\right)+\lambda ||D\boldsymbol{\beta}||_1,\]

with \(\ell\left(Y;X,\boldsymbol{\beta}\right) = \sum_{i = 1}^{n} y_i \log(\frac{1}{1+\exp(-\boldsymbol{\beta}^{\top} x)})+(1-y_i)\log(\frac{1}{1+\exp(\boldsymbol{\beta}^{\top} x)})\).

\(D\): Matrix representing structural constraint on parameters:
- \(D = I_p \Rightarrow\) classical lasso penalty (Gramacy, Jensen, and Taddy 2013).
- \(||D\boldsymbol{\beta}||_1 = \sum_{i \le j } |\beta_i-\beta_j| \Rightarrow\) ranking lasso penalty (Masarotto and Varin 2012).
- In our case: \(D\) slightly different from ranking lasso structure.

Desideratum: Flexible, robust and fast algorithm \(\Rightarrow\) Conic programming approach for adaptive generalized logistic lasso (AGLL) model.

Methodology

Conic program (CP): A quick Intro

Basic definition:

\[ \begin{align} \text{minimize } & a_0^{\top}x \\ \text{s.t. } & Ax+s = b, \quad s \in \mathcal{K}. \end{align} \]

\(\mathcal{K}\) is a composition of simple convex cones.
E.g.: Any standard constrained linear problem \(a_0^{\top}x\) s.t. \(Ax \le b\) is a conic problem consisting of linear cones \(K_{{\rm lin}} = \{x \in \mathbb{R} | x \ge 0\}\).
Any convex problem can be reformulated into CP via its epigraph form (Boyd and Vandenberghe 2004).

Initial convex problem:

\[ \begin{align*} \text{minimize } & f_0(x) \\ \text{s.t. } & f_i(x) \le 0. \end{align*} \]

\[\Leftrightarrow\]

Epigraph form:

\[ \begin{align*} \text{minimize } & t \\ \text{s.t. } & f_0(x) \le t, \quad f_i(x) \le 0. \end{align*} \]

Conic programming

Only a few convex cones allow for modelling of a variety of problems:
- Almost all GLMs can be modeled via composition of only 3 types of cones: linear, second-order and exponential cone (B. Schwendinger, Schwendinger, and Vana 2024).
- Easy extension for many constraints (e.g. generalized lasso constraint only needs additional linear cones).
Algorithmic advantages of CPs:
- No starting values needed.
- Very flexible and reliable.

Proven to be robust, reliable and effective for similar type of problems (F. Schwendinger, Grün, and Hornik 2021).
Readily implemented in R via package ROI (Theußl, Schwendinger, and Hornik 2020).

The AGLL problem

\[ \begin{align} \min_{\boldsymbol{\beta}} \quad -\sum_{i = 1}^{n} \big(y_i \log(\frac{1}{1+\exp(-\boldsymbol{\beta}^{\top} x_i)})+(1-y_i)\log(\frac{1}{1+\exp(\boldsymbol{\beta}^{\top} x)})\big)+ \\ \lambda \sum_{j = 1}^m w_{j}|D_j| \end{align} \]

\(y_i\) binary outcome, \(x_i\) \(i\)-th Row of \(X \in \mathbb{R}^{n\times p}\), \(\boldsymbol{\beta} \in \mathbb{R}^p\) .
\(D_j\): \(j\)-th component of \(D\boldsymbol{\beta}\), with \(D\in \mathbb{R}^{m \times p}\).
- Represents a structural constraint on coefficients \(\beta_i\), \(i = 1,\dots,p\)
\(w_j\): adaptively control for degree of shrinkage for specific terms (Zou 2006).

Conic form of the AGLL

\[ \begin{align} \min_{(\boldsymbol{r},\boldsymbol{\beta},\boldsymbol{s},\boldsymbol{t},\boldsymbol{z}_1,\boldsymbol{z}_2)} & \sum_{i = 1}^n t_i +\lambda r \\ \text{s.t.} & \left(z_{i1}, 1, u_i-t_i\right) \in K_{\rm exp}, \\ & \left(z_{i2}, 1,-t_i\right) \in K_{\rm exp }, \\ & z_{i1}+z_{i2} \leq 1, & i = 1,\dots,n, \\ & -s_j \le D_j \le s_j, & j = 1,\dots,m, \\ & r - w_1D_1 - \dots - w_mD_m \ge 0. \end{align} \]

Conic program on augmented set of variables \((\boldsymbol{r},\boldsymbol{\beta},\boldsymbol{s},\boldsymbol{t},\boldsymbol{z}_1,\boldsymbol{z}_2)\), where \(u_i = \pm \beta^\top x_i\).

Conic form of the AGLL

\[ \begin{align} \min_{(\boldsymbol{r},\boldsymbol{\beta},\boldsymbol{s},\boldsymbol{t},\boldsymbol{z}_1,\boldsymbol{z}_2)} & \sum_{i = 1}^n t_i +\lambda r \\ \text{s.t.} & \color{green}{\left(z_{i1}, 1, u_i-t_i\right) \in K_{\rm exp}}, \\ & \color{green}{\left(z_{i2}, 1,-t_i\right) \in K_{\rm exp }}, \\ & z_{i1}+z_{i2} \leq 1, & i = 1,\dots,n, \\ & -s_j \le D_j \le s_j, & j = 1,\dots,m, \\ & r - w_1D_1 - \dots - w_mD_m \ge 0. \end{align} \]

Conic program on augmented set of variables \((\boldsymbol{r},\boldsymbol{\beta},\boldsymbol{s},\boldsymbol{t},\boldsymbol{z}_1,\boldsymbol{z}_2)\), where \(u_i = \pm \beta^\top x_i\).
Composition of two classes of convex cones:
- Exponential cones for lines 1-2 of the constraints.

Conic form of the AGLL

\[ \begin{align} \min_{(\boldsymbol{r},\boldsymbol{\beta},\boldsymbol{s},\boldsymbol{t},\boldsymbol{z}_1,\boldsymbol{z}_2)} & \sum_{i = 1}^n t_i +\lambda r \\ \text{s.t.} & \left(z_{i1}, 1, u_i-t_i\right) \in K_{\rm exp}, \\ & \left(z_{i2}, 1,-t_i\right) \in K_{\rm exp }, \\ & \color{green}{z_{i1}+z_{i2} \leq 1}, & i = 1,\dots,n, \\ & \color{green}{-s_j \le D_j \le s_j}, & j = 1,\dots,m, \\ & \color{green}{r - w_1D_1 - \dots - w_mD_m \ge 0}. \end{align} \]

Conic program on augmented set of variables \((\boldsymbol{r},\boldsymbol{\beta},\boldsymbol{s},\boldsymbol{t},\boldsymbol{z}_1,\boldsymbol{z}_2)\), where \(u_i = \pm \beta^\top x_i\).
Composition of two classes of convex cones:
- Exponential cones for lines 1-2 of the constraints.
- Linear cones for lines 3-5 of constraints.

Implementation

Make use of available conic solvers.
- In R: package ROI allows for usage of variety of solvers.
Weights: Masarotto and Varin (2012) suggest weights inversely proportional to modified MLE with small ridge penalty on parameters.

\[w_{j}=\left|D\tilde{\boldsymbol{\beta}}\right|^{-1}, \quad \tilde{\boldsymbol{\beta}}_{\epsilon}={\arg \min}_{\boldsymbol{\beta}} \left\{-\ell(\boldsymbol{\beta})+\epsilon \sum_{i}\beta_i^2\right\}.\]

Selection of \(\lambda\) through minimization of AIC type criterion (Tibshirani and Taylor 2011): \[{\rm AIC}(\lambda) = -2\ell(\boldsymbol{\beta})+2{\rm enp}(\lambda),\] \({\rm enp}(\lambda) \dots\) effective number of parameters estimated as number of distinct groups for \(\lambda\).

Simulation

Simulation setup

Comparison of approaches:
- Conic approaches:
  - ROI implementation with different solvers (Mosek, Clarabel, ECOS).
  - direct implementation with solvers (Mosek, Clarabel).
- Augmented Lagrangian methods: implementation with package alabama:
  - Augmented Lagrangian Minimization Algorithm with auglag.
  - Augmented Lagrangian Adaptive Barrier Minimization Algorithm with constrOptim.nl.
Simulation studies with \(n = 5000\) (sample size) and \(p = 20,40,50,80\) (number of covariates).
50 repetition for each scenario.

Simulation setup

Covariate structure: 4 groups of coefficients, 2/3 of the coefficients relevant ( \(X\) Gaussian).
Logistic regression model: binary response \(Y\) is drawn from: \[y_i | x_{i1},\dots,x_{im} \sim {\rm Ber}(p_i), \quad m = 1,\dots,\lfloor \frac{2}{3}p \rfloor,\] where \(p_i = 1/\big(1+\exp(-\boldsymbol{\beta}^{\top} x)\big)\).
Evaluation of simulation:
- Runtime in seconds.
- Absolute differences in penalized negative log-likelihood (differences in objective value).
- Number of coefficient groups.

Simulation results I: runtime

Simulation results II: optimum

Simulation results III: grouping

Average number of groups detected (in 50 runs of simulations, fixed \(\lambda\)):

Simulation study: Takeaway

Conic solver substantially faster than augmented Lagrangian methods:
- ROI implementation leads to runtime overhead but allows for flexibility.
- \(p = 80\): Low-dimensional in comparison to application.
Conic solvers more accurate in terms of finding optimal solution.
Conic solvers more accurately retrieve implicit grouping structure than alabama.
Conic solver require no starting values!

Application

Goals in hockey

Aim: rank players based on on-ice-time during goals.
Data set of hockey goals (Gramacy, Jensen, and Taddy 2013):
- Outcome variable: goal scored by home (1) or away (0) team; \(\quad N = 69449\).
- Player information: Configuration of players on field (+1 if on field for home team, -1 for away team, 0 not on field); \(\quad n_P = 2439\).
- Team information: Team-season indicator for scoring team (1) and opponent team (-1); \(\quad n_T = 330\).
Idea:
- Estimate players strength while enforcing implicit grouping of players.
- Simultaneously estimate team strengths with grouping.
- Separate player effects for team effects.

Goals in hockey: Penalty matrix \(D\)

Set up matrix \(D \Rightarrow\) separate ranking lasso penalty for player columns and team columns

\[ \begin{align} D = \left(\begin{array}{cc} D_{\rm player} & \boldsymbol{0} \\ \boldsymbol{0} & D_{\rm team} \end{array}\right), \quad D_{\rm player} = \left(\begin{array}{ccc} A_1 & B_1 & C_1 \\ \vdots & \vdots & \vdots \\ A_{n_p-1} & B_{n_p-1} & C_{n_p-1} \end{array}\right). \end{align} \]

\(A_i\): Matrices of zero of dimension \((n_p-1) \times (i-1)\).
\(B_i\): vector of ones of length \(n_p\).
\(C_i\): negative identity matrix of dimension \((n_p-1)\).
Set up \(D_{\rm team}\) analogously.
Total dimension of \(D\): \(3030195 \times 2769\) (Simulations: \(p = 50\) \(\rightarrow\) \(1275 \times 50\)!).

Ranking NHL

Summary

Fast, flexible and robust implementation of the adaptive generalized logistic lasso (AGLL) model using conic programming methods.
Application to ranking in sports \(\Rightarrow\) high-dimensional and sparse data setup.

Work in progress and extensions:

Implementation in R via package ROI \(\Rightarrow\) flexible but overhead in computation (work in progress).
Generalization to other GLM type of models and non linear effects possible.

Thank you for your attention!

References

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press.

Gramacy, Robert B., Shane T. Jensen, and Matt Taddy. 2013. “Estimating Player Contribution in Hockey with Regularized Logistic Regression.” Journal of Quantitative Analysis in Sports 9 (1): 97–111. https://doi.org/doi:10.1515/jqas-2012-0001.

Hvattum, Lars Magnus. 2019. “A Comprehensive Review of Plus-Minus Ratings for Evaluating Individual Players in Team Sports.” International Journal of Computer Science in Sport 18 (1): 1–23. https://doi.org/doi:10.2478/ijcss-2019-0001.

Masarotto, Guido, and Cristiano Varin. 2012. “The ranking lasso and its application to sport tournaments.” The Annals of Applied Statistics 6 (4): 1949–70. https://doi.org/10.1214/12-AOAS581.

Schwendinger, Benjamin, Florian Schwendinger, and Laura Vana. 2024. “Holistic Generalized Linear Models.” Journal of Statistical Software 108 (7): 1–49. https://doi.org/10.18637/jss.v108.i07.

Schwendinger, Florian, Bettina Grün, and Kurt Hornik. 2021. “A Comparison of Optimization Solvers for Log Binomial Regression Including Conic Programming.” Computational Statistics 36 (3): 1721–54. https://doi.org/10.1007/s00180-021-01084-5.

Theußl, Stefan, Florian Schwendinger, and Kurt Hornik. 2020. “ROI: An Extensible R Optimization Infrastructure.” Journal of Statistical Software 94 (15): 1–64. https://doi.org/10.18637/jss.v094.i15.

Tibshirani, Ryan J., and Jonathan Taylor. 2011. “The Solution Path of the Generalized Lasso.” The Annals of Statistics 39 (3): 1335–71. https://doi.org/10.1214/11-AOS878.

Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29. https://doi.org/10.1198/016214506000000735.

Appendix

Cone examples

Original Problem:

\[ \begin{align*} \text{minimize}_y & \quad \exp(2y_1+y_2) \\ \text{s.t. } & \quad 5y_1+2 y_2 \le 4. \end{align*} \]

Epigraph form:

\[ \begin{align*} \text{minimize}_{y,t} & \quad t \\ \text{s.t. } & \quad \exp(2y_1+y_2) \le t \\ & \quad \quad 5y_1+2 y_2 \le 4. \end{align*} \]

Two convex cones necessary to write in conic form:
- Exponential cone \(K_{\exp} = \{x \in \mathbb{R}^3 | x_1 \ge x_2 \exp(x_3/x_2), x_1 > 0, x_2 > 0\}\):
  
  \[\exp(2y_1+y_2) \le t \Leftrightarrow (t,1,2y_1+y_2) \in K_{\rm exp}\]
- Linear cone:
  
  \[5y_1+2 y_2 \le 4 \Leftrightarrow 4-(5y_1+2 y_2) \in K_{\text{lin}}\]

Details on reformulation of the AGLL

Likelihood terms: \[ \begin{align} &- y_i \log(\frac{1}{1+\exp(-\boldsymbol{\beta}^{\top} x_i)})-(1-y_i)\log(\frac{1}{1+\exp(\boldsymbol{\beta}^{\top} x_i)}) \le t_i \\ \Leftrightarrow & \begin{cases}\log(1+\exp(-\beta^\top x_i)) \le t_i &, y_i = 1\\ \log(1+\exp(\beta^\top x_i)) \le t_i &, y_i = 0\end{cases} \end{align} \]
Constraint \[t \ge \log(1+\exp(u)) \Leftrightarrow \exp(u-t) + \exp(-t) \le 1\]
Exponential function can be modeled via exponential cone: \[ \begin{align} z_1 \ge \exp(u-t) &\Leftrightarrow (z_1,1,u-t) \in K_{\rm exp}. \\ z_2 \ge \exp(-t) &\Leftrightarrow (z_2,1,-t) \in K_{\rm exp} \end{align} \]
Additional linear cone is necessary for \(z_1 + z_2 \le 1\).

Implementation in R

3 steps necessary:
- Set up optimization problem: linear OP on augmented set of variables.
- Set up linear constraints matrix.
- Set up exponential cone constraints matrix.
Advantage of ROI:
- One syntax allows for usage of variety of solvers.
- Designed for R-users.
Weights: Masarotto and Varin (2012) suggest weights inversely proportional to modified MLE with small ridge penalty on parameters.

Selection of Lasso Penalty

Compute AGLL solution for a range of values of \(\lambda\).
Selection of \(\lambda\) through minimization of AIC type criterion (Tibshirani and Taylor 2011): \[{\rm AIC}(\lambda) = -2\ell(\boldsymbol{\beta})+2{\rm enp}(\lambda),\] \({\rm enp}(\lambda) \dots\) effective number of parameters estimated as number of distinct groups for \(\lambda\).
Classical CV procedure:
- Fit model on part of data.
- Evaluate performance on hold out sample.
- Average performance value over all folds \(\Rightarrow\) select \(\lambda\) with best performance.