Curve Clustering Methods and their Applications to Sports Analytics

class: center, middle, inverse, title-slide

.title[
# Curve Clustering Methods and their Applications to Sports Analytics
]
.author[
### Robert Bajons, joint work with Kurt Hornik
]
.institute[
### BBS 2023
]
.date[
### 2023-07-25
]

---

class: inverse, center, middle

# Motivation

---
# Motivation

---
# Motivation

Routes in the NFL.

---
# Motivation

Routes in the NFL: focus on pass rush.

---
# Motivation

Pass rush in the NFL.

---
# Motivation

Pass rush in the NFL: add pressure weights.

---
# Motivation

.pull-left[

- Routes or curves in sports analysis are quite common.

- In many problems trajectories come as weighted observations.

- Analyse weighted curves instead of only curves.

- Clustering algorithm for weighted curves needed.

]

.pull-right[
- Examples: Pass rush in the NFL, possession in soccer.

- Derive pressure weight models.

- Weights helpful to distinguish between *useful* routes and *worthless* routes.

- Possible advantage/strength: Being able to analyze strong players/teams in more detail.

]

---
class: inverse, center, middle

# Methodology

---
# Formal Description of the Problem

- Observed data `$\boldsymbol{Y} = \{\boldsymbol{y}_1,\dots,\boldsymbol{y}_n\}$` of `$n$` curves.

- Each `$\boldsymbol{y_i}$` is `$m_i \times 3$` dimensional matrix, `$(x,y)$`-coordinates and weights `$w$`.

- `$m_i$` not fixed but varies for each observation. (E.g.: routes in NFL dependent on length of play.)

**Goal:** Cluster routes, i.e. `$(x,y)$` coordinates by accounting for the information from the weights `$w$`.

- Initial idea: Use a `$K$`-means type of algorithm for clustering.

- Adjust usual procedure by deriving a weighted `$K$`-means method.

- Data preprocessing necessary to use `$K$`-means algorithm.

---
# Bézier Curves

.pull-left[

- Parametric curves defined on `$t \in [0,1]$` by control points `$\boldsymbol{\theta} = (\theta_0,\dots,\theta_p)$` using Bernstein polynomials `$b_p^P(t) = \binom{P}{p}t^p(1-t)^{P-p}$`, `$p = 0,\dots,P$`.

`$$B(t,\boldsymbol{\theta}) = \sum_{p = 0}^P \theta_p b_p^P(t), \quad t \in [0,1]$$`

- Control points form a polygon and Bézier curve defined by them lies in convex hull of the points.

]

.pull-right[

]

---
# Preprocessing Data

- The trajectory of each observation `$\boldsymbol{y}_i$`, is represented by a 2-dimensional Bézier curve:

`$$\boldsymbol{y}_i(t) = \sum_{i = 1}^{m_i} b_p^P(t)(x_i,y_i)$$`

- Control vector `$\boldsymbol{\theta}$` is given by `$(x,y)$` points observed from `$\boldsymbol{y}_i$`.

- In theory any form of interpolation between the points could be used. Advantages of Bézier curves:

- Natural starting ( `$t = 0$` ) and end point ( `$t = 1$` ).
  - `$\Rightarrow$` Bézier curves allow for a very general smooth representation of the curve (helpful in the context of football routes).
  - Flexible usage for curves of different lengths. 
  
- Usage for `$K$`-means clustering:

- Use prespecified number of `$M$` equidistant points on above curve to represent `$\boldsymbol{y}_i$`, such that each observation is of dimension `$M \times 2$`.
  - Aggregate weights to get `$M$` new weights by considering proximity and order of points.

---
# From `$K$`-means to weighted `$K$`-means

.panelset[
.panel[.panel-name[K-means]

- Given a set of observations `$\boldsymbol{Y}$`, find a partition into `$K$` clusters by minimizing the within cluster sum of squares

`$$S = \sum_{k = 1}^K \sum_{i:g_i = k} (x_i-p_k)^2$$`
- Prototypes (centers) `$p_k$` are given as cluster means:

`$$p_k = \frac{1}{N_k}\sum_{i:g_i = k} x_i$$`
- Basic implementation: Use an iterative refinement procedure, which alternates between an assignment step and cluster update step.

]

.panel[.panel-name[weighted K-means]

- Given a set of observations `$\boldsymbol{Y}$`, where each `$y_i \in \boldsymbol{Y}$` can be assigned weights `$w_i$`,  find a partition into `$K$` clusters by minimizing the weighted within cluster sum of squares

`$$S_w = \sum_{k = 1}^K \sum_{i:g_i = k} w_i(x_i-p_k)^2$$`
- Prototypes (centers) `$p_k$` are given as weighted cluster means:

`$$p_k = \frac{\sum_{i:g_i = k} v_ix_i}{\sum_{i:g_i = k} v_i}$$`
- Adapt iterative refinement procedure to case of weighted observation.

]
]

---
# Weighted `$K$`-means for Curves

- Transform observation from preprocessing: `$\boldsymbol{y}_i \in \mathbb{R}^{M \times 3}$` is split up in vectors `$\boldsymbol{z}_i, \boldsymbol{w}_i \in \mathbb{R}^{2M}$`.

`$$\boldsymbol{z}_i = (x_1,\dots,x_M,y_1,\dots,y_M)$$`
`$$\boldsymbol{w}_i = (w_1,\dots,w_M,w_1,\dots,w_M)$$`
- Find clusters and prototypes such that:

`$$\min_{(p_{jk}),(g_i)}\sum_{k = 1}^K \sum_{i:g_i = k} \sum_{j = 1}^{\tilde{M}} w_{i,j}(z_{i,j}-p_{k,j})^2.$$`
- Find clusters by iteratively alternating between finding the optimal prototypes for a given cluster and finding the optimal cluster assignments given prototypes.

---
# Weighted `$K$`-means Algorithm

1. Initialize appropriate starting assignments of clusters.

2. For given cluster assignment find the optimal prototypes:
`$$p_{k,j} = \frac{\sum_{i:g_i = k} w_{i,j}z_{i,j}}{\sum_{i:g_i = k} w_{i,j}}$$`
3. Given prototypes find optimal cluster assignments by minimizing 
`$$\sum_{j = 1}^M w_{i,j}(z_{i,j}-p_{k,j})^2$$`
over `$k$`.

4. Repeat steps 2-3 until convergence, i.e. until change in function to optimize is below some tolerance level.

---
# Weighted `$K$`-means Recap

- Weighted `$K$`-means algorithm provides clustering of curves, i.e. `$(x,y)$`-coordinates by accounting for weights `$w$`.

- Spatial distance of curves is **less** important, when weights are small and **more** important, when weights are high.

- *Irrelevant* routes are likely to be clustered together, even if spatially very far from each other.

Disadvantage of weighted `$K$`-means:

- Heavy preprocessing necessary in order to apply algorithm to curves of different lengths.

- Hard clustering.

---
class: inverse, center, middle

# Application

---
# Data

- NFL tracking data:
  - Routes for every player on every passing play of the first half of the 2021/22 season (~ 8000 plays).
  - Route information from start of the play until relevant event (pass/fumble).
  - A number of player and play specific details.

- `$\sim$` 80000 Routes of defensive players to cluster:
  - Initial preprocessing steps comprise of standardizing each play to line of scrimmage ( `$x$`-axis) and football ( `$y$`-axis).
  - Routes are labeled into pass rushers and coverage players.
  
  
---
# Clustering Results
<img src="BBS_Bajons_files/figure-html/plot_clusters-1.png" width="1008" style="display: block; margin: auto;" />
[More comparisons in Appendix](#res_pr_only1)

---
# Clustering Results
<img src="BBS_Bajons_files/figure-html/plot_clusters2-1.png" width="1008" style="display: block; margin: auto;" />

---
# Analysing Players
<img src="BBS_Bajons_files/figure-html/plot_players-1.png" width="1008" style="display: block; margin: auto;" />

---
# Analysing Teams

.panelset[
.panel[.panel-name[Defense]
<img src="BBS_Bajons_files/figure-html/plot_teams_def-1.png" width="1008" style="display: block; margin: auto;" />
]
.panel[.panel-name[Offense]
<img src="BBS_Bajons_files/figure-html/plot_teams_off-1.png" width="1008" style="display: block; margin: auto;" />
]
]

---
# Play Database for Coaches

- Suppose coach wants to prepare for a game:
  - Analyse specific defensive player of the opponent team.
  - Analyse specific strengths of a team.
  
- Coach could go through all of the videotapes of the season to see how opponent player/team handled situations.

- Coach could filter database by considering specific cluster routes of interest and only analyse relevant tapes.

- Example: 
  - Analyse strong pass rusher who generally plays on the left but is used flexible and also strong on the right side.
  - Filter for strong right side cluster plays of player and analyse only relevant tapes (filter for pressure as well).

---
class: inverse, center, middle

# More Work To Do...

---
# Regression Mixture Models

- Following Chu, Reyers, Thomson et al. (2020) consider the following regression model for curves `$\boldsymbol{y}_i$`:

`\begin{equation}
\boldsymbol{y}_i = \boldsymbol{T}_i\boldsymbol{\theta} + \boldsymbol{\epsilon}_i.
\end{equation}`

- `$\boldsymbol{T}_i$` is an `$m_i \times (P+1)$`-matrix of Bernstein polynomials evaluated at each timepoint `$t_j, j = 1,\dots,m_i$`.

- `$\boldsymbol{\epsilon}_i$` is a Gaussian error term.

- In the simplest case: obtain a probabilistic model for each curve, such that `$\boldsymbol{y}_i$` given `$\boldsymbol{T}_i$` follows `$\mathcal{N}(\boldsymbol{T}_i\boldsymbol{\theta}, \sigma^2 I)$`.

- Curves assumed to be centered around a Bézier curve, where the control points `$\boldsymbol{\theta}$` describing the mean curve (as well as the variance `$\sigma^2$`) need to be estimated.

---
# Regression Mixture Models

- Incorporate cluster structure by assuming mixture model for each curve, i.e.:
`\begin{equation}
P(\boldsymbol{y}_i|\boldsymbol{T}_i,\Theta) = \sum_k^K \alpha_k p_k(\boldsymbol{y}_i|\boldsymbol{T}_i\boldsymbol{\theta}_k,\sigma_k^2),
\end{equation}`
where `$p_k$` is the normal density.

- Following Gaffney and Smyth (2004) an EM algorithm can be derived to estimate cluster specific parameters.

- Idea: Adapt procedure using weighted likelihood methods (c.f. Newton and Raftery (1994)).

- So far work in progress...

- Advantages:
  - No need for heavy preprocessing.
  - Soft clustering.
  - At the same time: due to using weights on observations in likelihood, similar effect, that irrelevant routes have less effect on cluster estimation.

---
class: inverse, center, middle
# Thank You!

---
name: mylastslide
# References
Chu, D., M. Reyers, J. Thomson, et al. (2020). "Route identification in the National Football League". In: _Journal of Quantitative Analysis in Sports_ 16.2, pp.
121-132. DOI: [doi:10.1515/jqas-2019-0047](https://doi.org/doi%3A10.1515%2Fjqas-2019-0047). URL:
[https://doi.org/10.1515/jqas-2019-0047](https://doi.org/10.1515/jqas-2019-0047).

Gaffney, S. J. and P. Smyth (2004). "Probabilistic Curve-Aligned Clustering and Prediction with Regression Mixture Models". AAI3119737.

Hastie, T., R. Tibshirani, and J. Friedman (2009). _The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition_. Springer Series in
Statistics. Springer New York. ISBN: 9780387848587. URL: [https://books.google.at/books?id=tVIjmNS3Ob8C](https://books.google.at/books?id=tVIjmNS3Ob8C).

James, G., D. Witten, T. Hastie, et al. (2021). _An Introduction to Statistical Learning: with Applications in R_. Springer Texts in Statistics. Springer US. ISBN:
9781071614174. URL: [https://books.google.at/books?id=g5gezgEACAAJ](https://books.google.at/books?id=g5gezgEACAAJ).

Miller, A. C. and L. Bornn (2017). "Possession sketches : Mapping nba strategies". In: _Proceedings of the 2017 MIT Sloan Sports Analytics Conference_.

Newton, M. A. and A. E. Raftery (1994). "Approximate Bayesian Inference with the Weighted Likelihood Bootstrap". In: _Journal of the Royal Statistical Society. Series
B (Methodological)_ 56.1, pp. 3-48. ISSN: 00359246. URL: [http://www.jstor.org/stable/2346025](http://www.jstor.org/stable/2346025) (visited on Jun. 15, 2023).

---
class: inverse, center, middle

# Appendix

---
name:res_pr_only1
# Pass Rush Routes only

---
name:res_pr_only2
# Pass Rush Routes only

- Analyse predictive performance of clusters in terms of correctly classifying pressure on a play.

- Are the clusters helping in predicting pressure?

- Table below shows results based on different kind of metrics. (Brier score, Accuracy, Recall, Precision, F-score, Matthews correlation coefficient, Area under the ROC)

<div class="reactable html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-e147a5bdfe2d3ebda3ba" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-e147a5bdfe2d3ebda3ba">{"x":{"tag":{"name":"Reactable","attribs":{"data":{".rownames":["wkm","km"],"BS":[0.088,0.091],"acc":[0.833,0.784],"rec":[0.481,0.544],"prec":[0.346,0.281],"F":[0.403,0.371],"MCC":[0.013,0.012],"AUC":[0.763,0.75]},"columns":[{"id":".rownames","name":"","type":"character","align":"center","sortable":false,"filterable":false,"rowHeader":true},{"id":"BS","name":"BS","type":"numeric","align":"center"},{"id":"acc","name":"acc","type":"numeric","align":"center"},{"id":"rec","name":"rec","type":"numeric","align":"center"},{"id":"prec","name":"prec","type":"numeric","align":"center"},{"id":"F","name":"F","type":"numeric","align":"center"},{"id":"MCC","name":"MCC","type":"numeric","align":"center"},{"id":"AUC","name":"AUC","type":"numeric","align":"center"}],"highlight":true,"bordered":true,"dataKey":"1f44739247f095d374dbb1b139a5cec7"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>