Curve clustering methods and their applications to sports analytics

class: center, middle
background-image: url(prp_qr.png)
background-size: 10 %

<br>

# Curve clustering methods and their applications to sports analytics

.large[Robert Bajons | Joint work with Kurt Hornik | 17 Dec. 2023]

Vienna University of Economics and Business

.large[CMStatistics 2023]

.pull-left[
<img src="cmstat.png" width="100px"/>
]
.pull-right[
<img src="WU_Logo.png" width="100px"/>
]

.pull-right-wide {
  float: right;
  width: 85%;
}

/* Clear floats after the columns */
.pull-right-wide + * { clear: both; }

.my-bold-col {
  font-weight: bold;
  font-size: 1.5em;
  color: #70e941;
}
.my-bold {
  font-weight: bold;
  font-size: 1em;
  color: black;
}

.my-red {
  font-weight: bold;
  font-size: 1em;
  color: red;
}

</style>

---
class: inverse, center, middle

# Motivation

---
# Motivation

- **Curves** (spatio-temporal observations) are common in sports analysis:
  
    - Movement of players on the pitch.
    - Sequences of actions on the pitch.  
    
  `$\Rightarrow$` studying patterns in curves allows for interesting insights.

- Depending on use case: assign weights to curves at each time point.

- Finding structure in augmented data `$\Rightarrow$` clustering algorithm for weighted curves needed.

- Weights helpful to distinguish curves according to their **worth** `$\Rightarrow$` obtain more representative patterns.

- Examples: 
  - **Pass rush in the NFL**: Derive pressure weights for each defender at each point in time.
  - **Possessions in soccer**: Assign a threat value at each point in time, e.g. probability of scoring a goal after `$i$` more actions.

---
# Motivation

Routes in the NFL: focus on **pass rush**.

---
name: prwslide
# Motivation

Pass rush in the NFL: add **pressure weights**.

---
# Motivation

Possessions in soccer.

---
# Motivation

Possessions in soccer with **goal scoring weights**.

---
# Motivation

Possessions in soccer with **goal scoring weights**.

.panelset[
.panel[.panel-name[Poss1]

]
.panel[.panel-name[Poss2]

]
.panel[.panel-name[Poss3]

]
]

---
class: inverse, center, middle

# Methodology

---
# Formal Description of the Problem

- Observed data: `$\boldsymbol{Z} = \{\boldsymbol{z}_1,\dots,\boldsymbol{z}_n\}$`, set of `$n$` curves.

- Each `$\boldsymbol{z_i}$` is comprised of observations `$z_{it}$`, `$t = 1,\dots,T$`.
  
  - In our use cases: `$z_{it} = (x_i,y_i)_t$` 2-dimensional.

- At each `$t$`: a weight `$w_{it}$` is assigned to `$z_{it}$`.

**Formal goal:** Cluster curves `$\boldsymbol{z_i}$` ( `$(x,y)$`-coordinates) while accounting for information from weights `$w$` `$\Rightarrow$` estimate *good* prototypes (cluster representants).

- Initial idea: Use a `$K$`-means type of algorithm for clustering.

- Two adjustment of usual procedure necessary:

- Curve data adjustment (multiple time point).
  - Weighted `$K$`-means adjustment (weights at each time point).

- Note: In many applications data preprocessing necessary to use `$K$`-means algorithm.

---
# From `$K$`-means to weighted `$K$`-means

.panelset[
.panel[.panel-name[K-means]

- Given a set of observations `$\boldsymbol{X}$`. Find partition into `$K$` clusters represented by centroids/prototypes `$p_k$`, `$k = 1,\dots,K$` , by minimizing the within cluster sum of squares

`$$S = \sum_{k = 1}^K \sum_{i:g_i = k} (x_i-p_k)^2 \rightarrow \min_{(p_k),(g_i)}$$`

.pull-left[
- Hard problem to solve `$\Rightarrow$` iterative procedure:

- Fix cluster assignment: each observation is assigned to the closest cluster
  - Update optimal prototypes `$p_k$` for a given assignment:

`$$p_k = \frac{1}{N_k}\sum_{i:g_i = k} x_i$$`
]
.pull-right[
- Repeat until convergence:

- Guaranteed to not increase objective function `$S$`.
  - Convergence only to *local* optimum `$\Rightarrow$` *vary* starting assignments or use heuristics / prior information for starting values.
  
]

]

.panel[.panel-name[weighted K-means]

- Given a set of observations `$\boldsymbol{X}$`, with each `$x_i \in \boldsymbol{X}$` associated with weight `$w_i$`. Find partition into `$K$` clusters by minimizing the weighted within cluster sum of squares

`$$S_w = \sum_{k = 1}^K \sum_{i:g_i = k} w_i(x_i-p_k)^2.$$`

- Solve via iterative procedure:
  
  - assignment based on weighted squared distance.
  - Prototypes (centers) `$p_k$` are weighted cluster means:

`$$p_k = \frac{\sum_{i:g_i = k} w_ix_i}{\sum_{i:g_i = k} w_i}.$$`

- Weights emphasize clustering on *important* observations.
]
]

---
# `$K$`-means for weighted curves

- Observations: time series objects of the same length `$\boldsymbol{z}_i = (z_{i1},\dots,z_{iT})$` (component-wise weighted object).

- Additional information from weights `$\boldsymbol{w}_i = (w_{i1},\dots,w_{iT})$`:

- Indicate how *interesting* observations are (use case dependent).
  - Differ at each time point `$\Rightarrow$` more and less *important* time points.
  
  
- In the spirit of `$K$`-means type of algorithms:

`$$\sum_{k = 1}^K \sum_{i:g_i = k} \sum_{t = 1}^{T} w_{it}(z_{it}-p_{tk})^2 \rightarrow \min_{(p_{tk}),(g_i)}.$$`

---
# `$K$`-means for weighted curves

Iterative refinement procedure:

1. Initialize appropriate starting assignments of clusters.

2. For given cluster assignment find the optimal prototypes (component-wise):
`$$p_{kt} = \frac{\sum_{i:g_i = k} w_{it}z_{it}}{\sum_{i:g_i = k} w_{it}}.$$`
3. Given prototypes find optimal cluster assignments by minimizing 
`$$\sum_{t = 1}^T w_{it}(z_{it}-p_{kt})^2$$`
over `$k$`.

4. Repeat steps 2-3 until convergence, i.e. until change in function to optimize is below some tolerance level.

---
# Implementation in R

Implementation hinges on three functions:
  
  - Consensus function `C`: computes the optimal prototypes for given cluster assignment.
  - Dissimilarity function `D`: computes (weighted) distance for each observation to cluster.
  - Error function `E`: evaluates objective function for given assignments and prototypes.
  
Advantages:

- **Simplicity**: easy implementation, easily adoptable for use cases with weighted curves (component-wise weighted objects).
  - **Efficiency**: Only looping over clusters necessary, making use of vectorized `R` operation.
  - **Flexibility**: 
    - Extendable to allow for soft clustering.
    - Implementation of other distance measures intuitive (adjustment of `C`, `D` and `E`).

---
# Extensions

**What about soft clustering?**

- Fuzzy `$K$`-means: derive optimal prototypes and allow for mixed membership of the observations to clusters.

- Fuzzy extension of weighted `$K$`-means:

`$$\sum_{i=1}^N\sum_{k = 1}^K \sum_{t = 1}^{T} u_{ik}^m w_{it}(z_{it}-p_{kt})^2 \rightarrow \min_{(u_{ik}),(p_{tk})}$$`
--

- Solve via adapted iterative procedure:

- optimal prototypes given membership degrees: `$p_{kt} = \frac{\sum_{i = 1}^N u_{ik}^m w_{it}z_{it}}{\sum_{i = 1}^N u_{ik}^m w_{it}}$`.
  - derive membership degrees given prototypes: `$u_{ik} = \big(\sum_{j = 1}^K (\frac{d(z_i,w_i,p_k)}{d(z_i,w_i,p_{j})})^{\frac{1}{m-1}}\big)^{-1}$`, where `$d(z_i,w_i,p_k) = \sum_{t = 1}^T w_{it}(z_{it}-p_{kt})^2$`.

- Implementation in `R` follows similar idea as before `$\Rightarrow$` adjust `C` and `E`, implement function `U` for membership degrees.

---
class: inverse, center, middle

# Application

---
# Pass rush in the NFL

- NFL tracking data:

- Routes for every player on every passing play of the first half of the 2021/22 season (~ 7200 plays).
  - Route information from start of the play until relevant event (pass/fumble).
  - `$\sim$` 80000 Routes of defensive players to cluster.

- Initial preprocessing:
  
  - Standardize each play to line of scrimmage ( `$x$`-axis) and football ( `$y$`-axis).
  - Plays of different lengths `$\Rightarrow$` Use Bézier approximations to transform data to same dimension (necessary for `$K$`-means type algorithms).
  
- Augment data with pressure weights:

- Estimate probability of defender being able to pressure the quarterback  `$\Rightarrow$` gradient boosting tree model.

---
# Pass rush in the NFL

.panelset[
.panel[.panel-name[Clusters]

]
.panel[.panel-name[Teams]

]

.panel[.panel-name[Players]

]
]

---
# Pass rush in the NFL

---
# Possessions in soccer

- Event stream data:

- Events from all matches from the 2017/18 season of 5 top leagues in Europe (Bundesliga, La Liga, Ligue 1, Premier League, Seria A)

- Preprocessing:
  
  - Standardize such that offense always plays from left to right.
  - Build data set of possession: Sequence of consecutive actions by one team.
  - Consider only possession of length 6 `$\Rightarrow$` around 27000 possessions to cluster.
  
- Augment data with goal scoring weights:

- Combine models that value actions (VAEP, xT and xG).

---
# Possessions in soccer

---
class: inverse, center, middle

# Conclusion

---
# Conclusion

- Sports data is full of spatio-temporal measurements (curves) `$\Rightarrow$` Finding patterns in the data allows interesting insights.

- In most cases: augmenting curves with weights is quite intuitive `$\Rightarrow$` allows for clustering of interesting routes.

- We derived a weighted `$K$`-means type of algorithm for component-wise weighted curves/objects.
  
  - Derivation of iterative refinement procedure.
  - Implementation in `R`.
  - Extension to fuzzy `$K$`-means for weighted curve data.

- Applications:

- Pass rush in the NFL, Possessions in soccer.
  - Weighted `$K$`-means provides more desirable clustering.
  - Clustering can be used for analyses or in downstream task.
  
- Future work:
  
  - Time dependent fuzzy clustering.
  
    
---
class: center, middle

# Thank You!

Contact: [robert.bajons@wu.ac.at](mailto:robert.bajons@wu.ac.at)