Curve Clustering Methods for Analyzing Pass Rush in the NFL

class: center, middle
background-image: url(prp_qr.png)
background-size: 10 %

<br>

# Curve Clustering Methods for Analyzing Pass Rush in the NFL

.large[Robert Bajons | Joint work with Kurt Hornik | 23 Sept. 2023]

---
# Personal Motivation

- In Europe we're mostly playing a different kind of football (soccer).

- Austria (a small country in central Europe) is not very good at it.

- For European standards: Austria is quite good at American football:
  - Austria is playing for the European title (against Finland) this year.
  - Last season the Vienna Vikings won the European League of Football (the European pendant of the NFL).
  - NFL: Bernhard Raimann (Indianapolis Colts), Bernhard Seikovits (Arizona Cardinals, practice squad).

--
  
- About me: sports enthusiast, interested in statistical modelling of various sports (mainly soccer).

- Kurt Hornik (my supervisor): "Why don't we model real football?"

---
# Motivation

Routes in the NFL: focus on **pass rush**.

---
# Motivation

Pass rush in the NFL.

---
name: prwslide
# Motivation

Pass rush in the NFL: add pressure weights. ([Details](#wm))

---
# Motivation

- Assign weights to pass rush routes via machine learning model.

- Analyse weighted curves instead of regular curves.

- Weights helpful to distinguish routes according to their **worth**.

- **Clustering algorithm for weighted curves needed.**

- **Note:** Not only useful for pass rush but any type of route that may be assign weights.

---
class: inverse, center, middle

# Methodology

---
# Formal Description of the Problem

- Observed data `$\boldsymbol{Y} = \{\boldsymbol{y}_1,\dots,\boldsymbol{y}_n\}$` of `$n$` curves.

- Each `$\boldsymbol{y_i}$` is `$m_i \times 3$` dimensional matrix, `$(x,y)$`-coordinates and weights `$w$`.

- `$m_i$` not fixed but varies for each observation. (E.g.: routes in NFL dependent on length of play.)

**Goal:** Cluster routes, i.e. `$(x,y)$`-coordinates, by accounting for the information from the weights `$w$`.

- Initial idea: Use a `$K$`-means type of algorithm for clustering.

- Adjust usual procedure by deriving a weighted `$K$`-means method.

- Data preprocessing necessary to use `$K$`-means algorithm.

---
# Bézier Curves

.pull-left[

- Parametric curves defined on `$t \in [0,1]$` by control points `$\boldsymbol{\theta} = (\theta_0,\dots,\theta_p)$` using Bernstein polynomials `$b_p^P(t) = \binom{P}{p}t^p(1-t)^{P-p}$`, `$p = 0,\dots,P$`.

`$$B(t,\boldsymbol{\theta}) = \sum_{p = 0}^P \theta_p b_p^P(t), \quad t \in [0,1]$$`

- Control points form a polygon and Bézier curve defined by them lies in convex hull of the points.

]

.pull-right[

]

---
# Preprocessing Data

- The trajectory of each observation `$\boldsymbol{y}_i$`, is represented by a 2-dimensional Bézier curve:

`$$\boldsymbol{y}_i(t) = \sum_{i = 1}^{m_i} b_p^P(t)(x_i,y_i)$$`

- Control vector `$\boldsymbol{\theta}$` is given by `$(x,y)$` points observed from `$\boldsymbol{y}_i$`.

.pull-left[

- In theory **any form** of interpolation between the points could be used. **Advantages** of Bézier curves:

- Natural starting ( `$t = 0$` ) and end point ( `$t = 1$` ).
  - `$\Rightarrow$` Bézier curves allow for a very general smooth representation of the curve (helpful in the context of football routes).
  - Flexible usage for curves of different lengths. 
]
.pull-right[
- **Usage** for `$K$`-means clustering:

- Use prespecified number of `$M$` equidistant points on above curve to represent `$\boldsymbol{y}_i$`, such that each observation is of dimension `$M \times 2$`.
  - Aggregate weights to get `$M$` new weights by considering proximity and order of points.

]
---
# From `$K$`-means to weighted `$K$`-means

.panelset[
.panel[.panel-name[K-means]

- Given a set of observations `$\boldsymbol{Y}$`, find a partition into `$K$` clusters by minimizing the within cluster sum of squares

`$$S = \sum_{k = 1}^K \sum_{i:g_i = k} (x_i-p_k)^2$$`
- Prototypes (centers) `$p_k$` are given as cluster means:

`$$p_k = \frac{1}{N_k}\sum_{i:g_i = k} x_i$$`
- Basic implementation: Use an iterative refinement procedure, which alternates between an assignment step and cluster update step.

]

.panel[.panel-name[weighted K-means]

- Given a set of observations `$\boldsymbol{Y}$`, where each `$y_i \in \boldsymbol{Y}$` can be assigned weights `$w_i$`,  find a partition into `$K$` clusters by minimizing the weighted within cluster sum of squares

`$$S_w = \sum_{k = 1}^K \sum_{i:g_i = k} w_i(x_i-p_k)^2$$`
- Prototypes (centers) `$p_k$` are given as weighted cluster means:

`$$p_k = \frac{\sum_{i:g_i = k} w_ix_i}{\sum_{i:g_i = k} w_i}$$`
- Adapt iterative refinement procedure to case of weighted observation.

]
]

---
# Weighted `$K$`-means for Curves

- Transform observation from preprocessing: `$\boldsymbol{y}_i \in \mathbb{R}^{M \times 3}$` is split up in vectors `$\boldsymbol{z}_i, \boldsymbol{w}_i \in \mathbb{R}^{2M}$`.

`$$\boldsymbol{z}_i = (x_1,\dots,x_M,y_1,\dots,y_M)$$`
`$$\boldsymbol{w}_i = (w_1,\dots,w_M,w_1,\dots,w_M)$$`
- Find clusters and prototypes such that:

`$$\min_{(p_{jk}),(g_i)}\sum_{k = 1}^K \sum_{i:g_i = k} \sum_{j = 1}^{\tilde{M}} w_{i,j}(z_{i,j}-p_{k,j})^2.$$`
- Find clusters by iteratively alternating between finding the optimal prototypes for a given cluster and finding the optimal cluster assignments given prototypes.

---
# Weighted `$K$`-means Algorithm

1. Initialize appropriate starting assignments of clusters.

2. For given cluster assignment find the optimal prototypes:
`$$p_{k,j} = \frac{\sum_{i:g_i = k} w_{i,j}z_{i,j}}{\sum_{i:g_i = k} w_{i,j}}$$`
3. Given prototypes find optimal cluster assignments by minimizing 
`$$\sum_{j = 1}^M w_{i,j}(z_{i,j}-p_{k,j})^2$$`
over `$k$`.

4. Repeat steps 2-3 until convergence, i.e. until change in function to optimize is below some tolerance level.

---
# Weighted `$K$`-means Recap

- Weighted `$K$`-means algorithm provides clustering of curves, i.e. `$(x,y)$`-coordinates, by accounting for weights `$w$`.

- Spatial distance of curves is **less** important, when weights are small and **more** important, when weights are high.

- **Irrelevant** routes are likely to be clustered together, even if spatially far apart.

- Implementation of algorithm in R-programming language.

---
class: inverse, center, middle

# Application

---
name: res1
# Clustering Results
<img src="nessis_RB_files/figure-html/plot_clusters-1.png" width="1008" style="display: block; margin: auto;" />
[More comparisons](#res_pr_only1)

---
# Clustering Results
<img src="nessis_RB_files/figure-html/plot_clusters2-1.png" width="1008" style="display: block; margin: auto;" />

---
# Analysing Teams

.panelset[
.panel[.panel-name[Defense]
<img src="nessis_RB_files/figure-html/plot_teams-1.png" width="1008" style="display: block; margin: auto;" />
]
.panel[.panel-name[Offense]
<img src="nessis_RB_files/figure-html/plot_teams_def_more-1.png" width="1008" style="display: block; margin: auto;" />
]
]

---
# Analysing Players

.panelset[
.panel[.panel-name[Profiles]
<img src="nessis_RB_files/figure-html/plot_players-1.png" width="1008" style="display: block; margin: auto;" />
]
.panel[.panel-name[Radars]
<img src="nessis_RB_files/figure-html/plot_radars-1.png" width="1008" style="display: block; margin: auto;" />
]
]

---
# Play Database for Pass Rush

.panelset[
.panel[.panel-name[Idea]

.pull-left[
Easy filtering for pass rush plays based on a number of information:
- Team
- Players 
- Routes/Clusters
- Play specific (Yard line, down, pressure, etc)

Useful for:
- Preparation for game
- Analysis of strength and weaknesses
- Scouting 
]
.pull-right[
Example:
- Plays with Myles Garrett from the right and Jadaveon Clowney from the left.
- Both on routes of high pressures (cluster 10 (right) and 7 (left)).
- Pressure exerted on the QB.
]
]
.panel[.panel-name[Example]
<div class="reactable html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-7973f1737f8cc9070e5f" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-7973f1737f8cc9070e5f">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"team":["CLE","CLE","CLE","CLE","CLE"],"uId":["2021091209_2293","2021091209_3639","2021092601_693","2021101708_1650","2021103103_1080"],"possessionTeam":["KC","KC","CHI","ARI","PIT"],"quarter":[3,4,1,2,2],"down":[3,3,1,3,2],"gameClock":["10:31","03:41","04:18","10:12","10:39"],"plays":[{"nflId":[38667,41227,42816,44813,44847,44903,46073,46122,47829,53455,53481],"cl":[1,7,3,10,1,2,3,2,3,4,8],"pressure":[0,1,0,0,0,0,0,0,0,0,0],"coln":["X7591","X7592","X7593","X7594","X7595","X7596","X7597","X7598","X7599","X7600","X7601"],"positionGroup":["DL","DL","Backs","DL","DL","S","Backs","S","Backs","Backs","LB"],"pff_role_orig":["Pass Rush","Pass Rush","Coverage","Pass Rush","Pass Rush","Coverage","Coverage","Coverage","Coverage","Coverage","Pass Rush"],"displayName":["Malik Jackson","Jadeveon Clowney","Troy Hill","Myles Garrett","Malik McDowell","John Johnson","Denzel Ward","M.J. Stewart","Greedy Williams","Greg Newsome","Jeremiah Owusu-Koramoah"]},{"nflId":[37317,38667,41227,42816,44813,44847,44903,44974,46073,46122,53455],"cl":[2,1,7,4,10,7,2,4,3,2,4],"pressure":[0,0,1,0,1,1,0,0,0,0,0],"coln":["X7756","X7757","X7758","X7759","X7760","X7761","X7762","X7763","X7764","X7765","X7766"],"positionGroup":["LB","DL","DL","Backs","DL","DL","S","LB","Backs","S","Backs"],"pff_role_orig":["Coverage","Pass Rush","Pass Rush","Coverage","Pass Rush","Pass Rush","Coverage","Coverage","Coverage","Coverage","Coverage"],"displayName":["Malcolm Smith","Malik Jackson","Jadeveon Clowney","Troy Hill","Myles Garrett","Malik McDowell","John Johnson","Anthony Walker","Denzel Ward","M.J. Stewart","Greg Newsome"]},{"nflId":[37317,38667,41227,42816,44813,44847,44903,46073,46162,53455,53481],"cl":[2,1,7,3,10,7,2,4,4,3,4],"pressure":[0,0,1,0,0,0,0,0,0,0,0],"coln":["X22331","X22332","X22333","X22334","X22335","X22336","X22337","X22338","X22339","X22340","X22341"],"positionGroup":["LB","DL","DL","Backs","DL","DL","S","Backs","S","Backs","LB"],"pff_role_orig":["Coverage","Pass Rush","Pass Rush","Coverage","Pass Rush","Pass Rush","Coverage","Coverage","Coverage","Coverage","Coverage"],"displayName":["Malcolm Smith","Malik Jackson","Jadeveon Clowney","Troy Hill","Myles Garrett","Malik McDowell","John Johnson","Denzel Ward","Ronnie Harrison","Greg Newsome","Jeremiah Owusu-Koramoah"]},{"nflId":[38667,41227,42816,44813,44847,44903,44974,46073,46162,53455,53481],"cl":[8,7,4,10,7,2,2,4,3,2,2],"pressure":[0,1,0,1,0,0,0,0,0,0,0],"coln":["X58246","X58247","X58248","X58249","X58250","X58251","X58252","X58253","X58254","X58255","X58256"],"positionGroup":["DL","DL","Backs","DL","DL","S","LB","Backs","S","Backs","LB"],"pff_role_orig":["Pass Rush","Pass Rush","Coverage","Pass Rush","Pass Rush","Coverage","Coverage","Coverage","Coverage","Coverage","Coverage"],"displayName":["Malik Jackson","Jadeveon Clowney","Troy Hill","Myles Garrett","Malik McDowell","John Johnson","Anthony Walker","Denzel Ward","Ronnie Harrison","Greg Newsome","Jeremiah Owusu-Koramoah"]},{"nflId":[37317,38667,41227,42816,44813,44847,44903,44974,47829,52452,53455],"cl":[3,8,7,4,10,10,2,2,4,4,3],"pressure":[0,0,0,0,1,1,0,0,0,0,0],"coln":["X71809","X71810","X71811","X71812","X71813","X71814","X71815","X71816","X71817","X71818","X71819"],"positionGroup":["LB","DL","DL","Backs","DL","DL","S","LB","Backs","S","Backs"],"pff_role_orig":["Coverage","Pass Rush","Pass Rush","Coverage","Pass Rush","Pass Rush","Coverage","Coverage","Coverage","Coverage","Coverage"],"displayName":["Malcolm Smith","Malik Jackson","Jadeveon Clowney","Troy Hill","Myles Garrett","Malik McDowell","John Johnson","Anthony Walker","Greedy Williams","Grant Delpit","Greg Newsome"]}]},"columns":[{"id":"team","name":"team","type":"character"},{"id":"uId","name":"uId","type":"character"},{"id":"possessionTeam","name":"possessionTeam","type":"character"},{"id":"quarter","name":"quarter","type":"numeric"},{"id":"down","name":"down","type":"numeric"},{"id":"gameClock","name":"gameClock","type":"character"},{"id":"plays","name":"plays","type":"list"}],"bordered":true,"dataKey":"9b6383b5e4d47a6717cb98e876320092"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
]
]

---
class: center, middle

# Thank You!

Contact: [robert.bajons@wu.ac.at](mailto:robert.bajons@wu.ac.at)

---
class: inverse, center, middle

# Appendix

---
name: wm
# Modeling Pressure Weights

.pull-left[
Modelling:
- Extreme gradient boosted trees.
- Hyperparameter tuning via 5-fold cross validation on a grid.
- Model trained by optimizing over logloss.
- Evaluation on hold out set.
]
.pull-right[
Features:
+ Distance to the Quarterback: distance of defensive player, average distance of offense (total and nearest 6), average distance of defense (total and nearest 6, considered player excluded).
+ Distance to opponents: Nearest opponent, 2nd nearest opponent, 3rd nearest opponent, average distance to nearest 6 opponents.
+ Speed: QB speed and defender speed.
+ Acceleration.
+ Distance traveled since snap.
+ Pressure on play (hit, hurry, sack).
]

[hop back](#prwslide)

---
name:res_pr_only1
# Pass Rush Routes only

.panelset[
.panel[.panel-name[Routes]
<img src="nessis_RB_files/figure-html/plot_clusters_pr_only-1.png" width="1008" style="display: block; margin: auto;" />
]
.panel[.panel-name[Pressure]
<img src="nessis_RB_files/figure-html/plot_clusters_pr_only2-1.png" width="1008" style="display: block; margin: auto;" />
]
]
[Back](#res1)

---
name:res_pr_only2
# Pass Rush Routes only

- Analyse predictive performance of clusters in terms of correctly classifying pressure on a play.

- Are the clusters helping in predicting pressure?

- Fit a glm to predict pressure of each individual route (with only the cluster as predictor).

- Table below shows results based on different kind of metrics. (Brier score, Accuracy, Recall, Precision, F-score, Matthews correlation coefficient, Area under the ROC)

<div class="reactable html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-713d0b69ab0b3665a412" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-713d0b69ab0b3665a412">{"x":{"tag":{"name":"Reactable","attribs":{"data":{".rownames":["wkm","km"],"BS":[0.088,0.091],"acc":[0.833,0.784],"rec":[0.481,0.544],"prec":[0.346,0.281],"F":[0.403,0.371],"MCC":[0.013,0.012],"AUC":[0.763,0.75]},"columns":[{"id":".rownames","name":"","type":"character","align":"center","sortable":false,"filterable":false,"rowHeader":true},{"id":"BS","name":"BS","type":"numeric","align":"center","style":[{"color":"#008000","fontWeight":"bold"},{"color":"#e00000","fontWeight":"bold"}]},{"id":"acc","name":"acc","type":"numeric","align":"center","style":[{"color":"#008000","fontWeight":"bold"},{"color":"#e00000","fontWeight":"bold"}]},{"id":"rec","name":"rec","type":"numeric","align":"center","style":[{"color":"#e00000","fontWeight":"bold"},{"color":"#008000","fontWeight":"bold"}]},{"id":"prec","name":"prec","type":"numeric","align":"center","style":[{"color":"#008000","fontWeight":"bold"},{"color":"#e00000","fontWeight":"bold"}]},{"id":"F","name":"F","type":"numeric","align":"center","style":[{"color":"#008000","fontWeight":"bold"},{"color":"#e00000","fontWeight":"bold"}]},{"id":"MCC","name":"MCC","type":"numeric","align":"center","style":[{"color":"#008000","fontWeight":"bold"},{"color":"#e00000","fontWeight":"bold"}]},{"id":"AUC","name":"AUC","type":"numeric","align":"center","style":[{"color":"#008000","fontWeight":"bold"},{"color":"#e00000","fontWeight":"bold"}]}],"highlight":true,"bordered":true,"dataKey":"1490bfdc26e3a917d876dffff90602e2"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>