class: center, middle background-image: url(prp_qr.png) background-size: 10 % <br> # Curve clustering methods and their applications to sports analytics .large[Robert Bajons | Joint work with Kurt Hornik | 17 Dec. 2023] Vienna University of Economics and Business .large[CMStatistics 2023] .pull-left[ <img src="cmstat.png" width="100px"/> ] .pull-right[ <img src="WU_Logo.png" width="100px"/> ] <!-- <img src="Nessis.png" width="300px"/> --> <style type="text/css"> .pull-left-narrow { float: left; width: 10%; } .pull-right-wide { float: right; width: 85%; } /* Clear floats after the columns */ .pull-right-wide + * { clear: both; } .my-bold-col { font-weight: bold; font-size: 1.5em; color: #70e941; } .my-bold { font-weight: bold; font-size: 1em; color: black; } .my-red { font-weight: bold; font-size: 1em; color: red; } </style> --- class: inverse, center, middle # Motivation --- # Motivation - **Curves** (spatio-temporal observations) are common in sports analysis: - Movement of players on the pitch. - Sequences of actions on the pitch. `\(\Rightarrow\)` studying patterns in curves allows for interesting insights. -- - Depending on use case: assign weights to curves at each time point. - Finding structure in augmented data `\(\Rightarrow\)` clustering algorithm for weighted curves needed. - Weights helpful to distinguish curves according to their **worth** `\(\Rightarrow\)` obtain more representative patterns. -- - Examples: - **Pass rush in the NFL**: Derive pressure weights for each defender at each point in time. - **Possessions in soccer**: Assign a threat value at each point in time, e.g. probability of scoring a goal after `\(i\)` more actions. --- # Motivation Routes in the NFL: focus on **pass rush**. <img src="cmstat_bajons_files/figure-html/plot1.5-1.png" width="1350px" height="433px" style="display: block; margin: auto;" /> --- name: prwslide # Motivation Pass rush in the NFL: add **pressure weights**. <img src="cmstat_bajons_files/figure-html/plot2.5-1.png" width="576" style="display: block; margin: auto;" /> --- # Motivation Possessions in soccer. <img src="cmstat_bajons_files/figure-html/poss1-1.png" width="1008" style="display: block; margin: auto;" /> --- # Motivation Possessions in soccer with **goal scoring weights**. <img src="cmstat_bajons_files/figure-html/poss1.2-1.png" width="1008" style="display: block; margin: auto;" /> --- # Motivation Possessions in soccer with **goal scoring weights**. .panelset[ .panel[.panel-name[Poss1] <img src="cmstat_bajons_files/figure-html/poss1.3-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[Poss2] <img src="cmstat_bajons_files/figure-html/poss2-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[Poss3] <img src="cmstat_bajons_files/figure-html/poss3-1.png" width="1008" style="display: block; margin: auto;" /> ] ] --- class: inverse, center, middle # Methodology --- # Formal Description of the Problem - Observed data: `\(\boldsymbol{Z} = \{\boldsymbol{z}_1,\dots,\boldsymbol{z}_n\}\)`, set of `\(n\)` curves. - Each `\(\boldsymbol{z_i}\)` is comprised of observations `\(z_{it}\)`, `\(t = 1,\dots,T\)`. - In our use cases: `\(z_{it} = (x_i,y_i)_t\)` 2-dimensional. - At each `\(t\)`: a weight `\(w_{it}\)` is assigned to `\(z_{it}\)`. -- **Formal goal:** Cluster curves `\(\boldsymbol{z_i}\)` ( `\((x,y)\)`-coordinates) while accounting for information from weights `\(w\)` `\(\Rightarrow\)` estimate *good* prototypes (cluster representants). -- - Initial idea: Use a `\(K\)`-means type of algorithm for clustering. - Two adjustment of usual procedure necessary: - Curve data adjustment (multiple time point). - Weighted `\(K\)`-means adjustment (weights at each time point). - Note: In many applications data preprocessing necessary to use `\(K\)`-means algorithm. --- # From `\(K\)`-means to weighted `\(K\)`-means .panelset[ .panel[.panel-name[K-means] - Given a set of observations `\(\boldsymbol{X}\)`. Find partition into `\(K\)` clusters represented by centroids/prototypes `\(p_k\)`, `\(k = 1,\dots,K\)` , by minimizing the within cluster sum of squares `$$S = \sum_{k = 1}^K \sum_{i:g_i = k} (x_i-p_k)^2 \rightarrow \min_{(p_k),(g_i)}$$` .pull-left[ - Hard problem to solve `\(\Rightarrow\)` iterative procedure: - Fix cluster assignment: each observation is assigned to the closest cluster - Update optimal prototypes `\(p_k\)` for a given assignment: `$$p_k = \frac{1}{N_k}\sum_{i:g_i = k} x_i$$` ] .pull-right[ - Repeat until convergence: - Guaranteed to not increase objective function `\(S\)`. - Convergence only to *local* optimum `\(\Rightarrow\)` *vary* starting assignments or use heuristics / prior information for starting values. ] ] .panel[.panel-name[weighted K-means] - Given a set of observations `\(\boldsymbol{X}\)`, with each `\(x_i \in \boldsymbol{X}\)` associated with weight `\(w_i\)`. Find partition into `\(K\)` clusters by minimizing the weighted within cluster sum of squares `$$S_w = \sum_{k = 1}^K \sum_{i:g_i = k} w_i(x_i-p_k)^2.$$` - Solve via iterative procedure: - assignment based on weighted squared distance. - Prototypes (centers) `\(p_k\)` are weighted cluster means: `$$p_k = \frac{\sum_{i:g_i = k} w_ix_i}{\sum_{i:g_i = k} w_i}.$$` - Weights emphasize clustering on *important* observations. ] ] --- # `\(K\)`-means for weighted curves - Observations: time series objects of the same length `\(\boldsymbol{z}_i = (z_{i1},\dots,z_{iT})\)` (component-wise weighted object). - Additional information from weights `\(\boldsymbol{w}_i = (w_{i1},\dots,w_{iT})\)`: - Indicate how *interesting* observations are (use case dependent). - Differ at each time point `\(\Rightarrow\)` more and less *important* time points. - In the spirit of `\(K\)`-means type of algorithms: `$$\sum_{k = 1}^K \sum_{i:g_i = k} \sum_{t = 1}^{T} w_{it}(z_{it}-p_{tk})^2 \rightarrow \min_{(p_{tk}),(g_i)}.$$` --- # `\(K\)`-means for weighted curves Iterative refinement procedure: 1. Initialize appropriate starting assignments of clusters. 2. For given cluster assignment find the optimal prototypes (component-wise): `$$p_{kt} = \frac{\sum_{i:g_i = k} w_{it}z_{it}}{\sum_{i:g_i = k} w_{it}}.$$` 3. Given prototypes find optimal cluster assignments by minimizing `$$\sum_{t = 1}^T w_{it}(z_{it}-p_{kt})^2$$` over `\(k\)`. 4. Repeat steps 2-3 until convergence, i.e. until change in function to optimize is below some tolerance level. --- # Implementation in R Implementation hinges on three functions: - Consensus function `C`: computes the optimal prototypes for given cluster assignment. - Dissimilarity function `D`: computes (weighted) distance for each observation to cluster. - Error function `E`: evaluates objective function for given assignments and prototypes. Advantages: - **Simplicity**: easy implementation, easily adoptable for use cases with weighted curves (component-wise weighted objects). - **Efficiency**: Only looping over clusters necessary, making use of vectorized `R` operation. - **Flexibility**: - Extendable to allow for soft clustering. - Implementation of other distance measures intuitive (adjustment of `C`, `D` and `E`). --- # Extensions **What about soft clustering?** - Fuzzy `\(K\)`-means: derive optimal prototypes and allow for mixed membership of the observations to clusters. - Fuzzy extension of weighted `\(K\)`-means: `$$\sum_{i=1}^N\sum_{k = 1}^K \sum_{t = 1}^{T} u_{ik}^m w_{it}(z_{it}-p_{kt})^2 \rightarrow \min_{(u_{ik}),(p_{tk})}$$` -- - Solve via adapted iterative procedure: - optimal prototypes given membership degrees: `\(p_{kt} = \frac{\sum_{i = 1}^N u_{ik}^m w_{it}z_{it}}{\sum_{i = 1}^N u_{ik}^m w_{it}}\)`. - derive membership degrees given prototypes: `\(u_{ik} = \big(\sum_{j = 1}^K (\frac{d(z_i,w_i,p_k)}{d(z_i,w_i,p_{j})})^{\frac{1}{m-1}}\big)^{-1}\)`, where `\(d(z_i,w_i,p_k) = \sum_{t = 1}^T w_{it}(z_{it}-p_{kt})^2\)`. - Implementation in `R` follows similar idea as before `\(\Rightarrow\)` adjust `C` and `E`, implement function `U` for membership degrees. --- class: inverse, center, middle # Application --- # Pass rush in the NFL - NFL tracking data: - Routes for every player on every passing play of the first half of the 2021/22 season (~ 7200 plays). - Route information from start of the play until relevant event (pass/fumble). - `\(\sim\)` 80000 Routes of defensive players to cluster. - Initial preprocessing: - Standardize each play to line of scrimmage ( `\(x\)`-axis) and football ( `\(y\)`-axis). - Plays of different lengths `\(\Rightarrow\)` Use Bézier approximations to transform data to same dimension (necessary for `\(K\)`-means type algorithms). - Augment data with pressure weights: - Estimate probability of defender being able to pressure the quarterback `\(\Rightarrow\)` gradient boosting tree model. --- # Pass rush in the NFL .panelset[ .panel[.panel-name[Clusters] <img src="cmstat_bajons_files/figure-html/main_res-1.png" width="864" style="display: block; margin: auto;" /> ] .panel[.panel-name[Teams] <img src="cmstat_bajons_files/figure-html/plot_teams-1.png" width="1008" style="display: block; margin: auto;" /> ] .panel[.panel-name[Players] <img src="cmstat_bajons_files/figure-html/plot_radars-1.png" width="1008" style="display: block; margin: auto;" /> ] ] --- # Pass rush in the NFL <img src="cmstat_bajons_files/figure-html/main_res_fuzzy-1.png" width="864" style="display: block; margin: auto;" /> --- # Possessions in soccer - Event stream data: - Events from all matches from the 2017/18 season of 5 top leagues in Europe (Bundesliga, La Liga, Ligue 1, Premier League, Seria A) - Preprocessing: - Standardize such that offense always plays from left to right. - Build data set of possession: Sequence of consecutive actions by one team. - Consider only possession of length 6 `\(\Rightarrow\)` around 27000 possessions to cluster. - Augment data with goal scoring weights: - Combine models that value actions (VAEP, xT and xG). --- # Possessions in soccer <img src="cmstat_bajons_files/figure-html/poss_clustering-1.png" width="1008" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Conclusion --- # Conclusion - Sports data is full of spatio-temporal measurements (curves) `\(\Rightarrow\)` Finding patterns in the data allows interesting insights. - In most cases: augmenting curves with weights is quite intuitive `\(\Rightarrow\)` allows for clustering of interesting routes. - We derived a weighted `\(K\)`-means type of algorithm for component-wise weighted curves/objects. - Derivation of iterative refinement procedure. - Implementation in `R`. - Extension to fuzzy `\(K\)`-means for weighted curve data. - Applications: - Pass rush in the NFL, Possessions in soccer. - Weighted `\(K\)`-means provides more desirable clustering. - Clustering can be used for analyses or in downstream task. - Future work: - Time dependent fuzzy clustering. --- class: center, middle # Thank You! <img src="prp_qr_light.png" width="300px"/> Contact: [robert.bajons@wu.ac.at](mailto:robert.bajons@wu.ac.at)