NBA Game Prediction

Abstract

When working with an NBA dataset, we wanted to figure out the best way to represent a network like structure amongst the teams and figured that the amount of time that each player spends on the court with one another would prove useful. By extracting this network and projecting player statistics upon each node, we will utilize GraphSage, a framework that will embed node features onto each player and aggregate each team to predict whether or not they can make the playoffs.

1. Introduction

GraphSAGE proposes a framework that can identify the node's neighborhood's structural properties as well as its role in the overall graph.[1] We can use GraphSAGE to utilizes node features to learn a general embedding function that simultaneously encodes the node's neighborhood's structure and feature distribution.[1] Additionally, GraphSAGE will be compared to graph convolutional neural networks on image data in the field of computer vision which just special cases of graphs. GCN apply simple "convolutions" to graph data to learn predictive features from weighted sums of progressively more complicated nonlinear relationships between nodes. GCNs share weights among nodes in a graph to more easily learn their inherently uneven distribution. GraphSAGE allows GCN to be applied in an inductive, unsupervised manner with trainable aggregation functions, which facilitates machine learning tasks on graphs that are not fixed and necessarily evolve over time, such as social networks.

In this paper, we are investigating the social network for individual NBA players and the relationship between each team. We will try to solve how to implement GraphSAGE to learn over the NBA players' dataset and find a correlation between teams that make the playoffs and those that do not. We will then run this model across current NBA players and their season's statistics to see which team is predicted to win given their current lineup. Our dataset is suitable for our given application because its graph has sparse and high-dimensional but meaningful connections that we can compress down to a low-dimensional embedding. For our current problem statement, we will be performing graph classification. The graph will have a total 120 data-point and it contains a set of node features. For each season we will have 30 data points (represented by 30 teams) and we decided to use 4 seasons in total to do our training model. Each node will be filled with players statistics as individual data points. We will use the team name as our node. Each node will contain the players of that team, with edges between players representing the amount of time they are on the court together. We will also label the data-point as 1 or 0 to indicate if they had made the playoffs.

The final goal is to use team statistics or players’ statistics and analysis for predicting which team will make to playoff(by leveraging the players statistics and team statistics from the 2015-2018 season to predict the probabilty of making the nba playoffs for each team given their roster). In order to have enough training data, we choose the data from the 2015-2019 season (may use more data as we do the project) and since it’s hard to predict the future season (do not have enough data for the current season) so we choose to predict the 2018-2019 season. We will also use data that we found to create edges between players on a team so that they will be connected by how much playing time they play with one another.

2. Datasets and Data Preparation

a) NBA_Players.csv
This csv file contains all NBA players' statistics during 2016-2021 NBA season. And it contains feature such as player names, team, ppg, apg, rpg and so on. And we will use these features later as players' statistic for the training part.

b) NBA_Teams.csv
This csv file contains all NBA Teams' statistics during 2016-2020 NBA season. And it contains the features such as team name, abbrev, year, playoff and standing(labeled 0 and 1 stands for didn't make to playoff and made to playoff). We will use these features later for the testing part.

c) events_[]-[]_pbp.cs
This csv file contains a historically account of the NBA play by play data by season. This dataset contains home players and away player IDS as well as the time during the game which well use to cross reference with player statistics. We have devised an algorithm that will utilize this dataset to determine the total minutes that each player spends with one another. Our script will manually scrape the dataset from the website and read it in as a CSV, so the user doesn't have to deal with the large dataset.

3. Data Analysis

We spent lots of time on data scraping since the datasets we use are very complicated. It's important for us to decided what features vectors should we use as players' and team's statistics. And we will use those features later on.

When thinking of what makes an NBA Team great, we looked through different statistics and we settled with individual points-per-game, assists, rebounds, field goal percentage minutes and games played. The aggregated totals of each player on the team in conjunction with the amount of time played together would help us in determining how successful a team's network of players can be. This was our initial datasets where we collected individual player statistics and we had another with just the teams, a boolean value on whether or not they made the playoffs.

A main chunk of our project then revolved around cleaning and extracting the information that we wanted from a large play-by-play dataset that consisted of each and every NBA game in a single season. Depending on the team network that had to be created, what we tried to do was to gather all the games that they participated in and analyzed at what time did players sub out for one another. By using python/pandas we were able to create an algorithm that would create an edge between every player in the lineup and adjust its weight according to the time in the game. Once we collected all edges from every game in the NBA season, we grouped the duplicate edges and summed them to get a total amount of playing time between each player on the team. This would represent the edges between players that we needed to create each individual team's network.

4. Methods

4.1 GraphSAGE

GraphSAGE that allow Mechine Learning methods to effectively utilize the graph data. With the GraphSAGE, it no longer needs to take extra time to train a new embedding for a graph seen only at run time that might need to be classified quickly. The GraphSAGE embedding learned from a large, representative training graph would be serviceable for the given task.

4.1.1 Embedding generation algorithm

We first will use GraphSAGE’s embedding algorithm to embedding the nodes,we import a graph G(V,E) and vector representations of node features Xv, ∀v∈V and outputs embeddingszvfor every node in the graph.[1] For our graphclassification graph, we are using 30 nodes which represent the 30 NBA teamsas individual data point and each node will filled with players’ statistics. Therewill be no edges between the nodes. For each node, it will contain 15 players ofthat team with players’ statistics as their features, and we will use the amountof time players are on the court together as edges to connect each player.

players example

Figure 1: Graph Classification example for Cleveland Cavaliers' 2016 season

Based on the example in Figure 1, we can see each players in the team connected with each other and amongst all the links, the darkest 5 paths show that these players possess a larger weight in the amount of playing time with one another. As a result, it can indicate who the main 5 players in that team are. Going off of this example, we can see the likes of Kyrie Irving, Tristan Thompson, Lebron James, Kevin Love and J.R. Smith being a large presence within this team. We believe that this unique combination of mutual playing time and categorical statistics will allow GraphSage to utilize its aggregation to the fullest extent and control its weights based off the impacts that each player has within their team's node.

one season example

Figure 2: This is the player network in 2016 season.

Our graph in Figure 2 represents all the players within the same network for the 2016 nba season. Each visible clump is representative of every team, while the outer nodes are players that have been traded/moved throughout the season. We will be learning over a network such as this using Graphsage, where we will be predicting each "clump" and trying to figure out what characteristics of an NBA team will help them reach the playoffs.

4.2 Loss Function

In order to get a clear result of our model, we use loss function to work on graph data and uses stochastic gradient descent to propagate error back through the network, progressively learning better and better representations even in the absence of labels. The output of the Softmax function will give us a vector, it standard for the probability distributions. The output will appended to the last layer of our graph classification network which will be the last layer of our GrapgSAGE model. The softmax calculation in addition to taking the negative log-likelihood over the data, and an early mistake in our implementations was including a softmax layer when it was in fact unnecessary [6].

5. Result

We report results of graph classification prediction performed by our models on csv datasets. Our baseline models used a learning rate of 0.05 and we use epochs of 200 to calculate our model's accuracy.

Hyperp

GraphSAGE helps to infer node label probabilities for unlabeled nodes by specifying what a neighborhood in the graph should look like and building a special model based off those neighborhoods to capture both local and global structure of node features. GraphSAGE builds an impressively versatile model with the ability to be applied to multiple graph machine learning tasks, classify data using far less labels, and can be trained on large-scale dataset, as having training data into memory is not required for prediction. Its learned aggregation functions allow it to create node embeddings that range in specificity, learning to recognize simpler or more complex relationships depending on the task at hand. Graph classification use GraphSAGE can make predictions with a simpler model using lower dimensional feature representations extracted from a graph.

6. Conclusion

In this paper, we have outlined graph classification of nodes in graph data. GraphSAGE identifies important node communities whose features can help learn representations of never before seen node distributions. We thought that utilizing a network of NBA players would sort of create a more emphasized community of about 5-8 core players within a team and place less emphasis on the role players. It can be applied to completely unlabeled data in addition to being a practical semi-supervised approach that enables analysis of large scale, dynamic graph datasets like social networks. For our report, we use GraphSAGE to embedding the neighborhood and aggregate each team in order to predict which team can go to the playoff.

7. Future Work

Since the GraphSAGE, the node's neighbors do not have a natural ordering, we should do the vector operation over the unordered set. The aggregator algorithmic is able to maintain a high representational capacity since it is symmetrical. That allows the neural network model to be trainable and can do the node neighborhood features in an arbitrarily ordering.

Pooling aggregator

For the pooling aggregator, it is also symmetric and trainable, the pooling approach is parametric and feeds each node's neighbor vectors through a trainable fully connected layer and then pools them using a max operation. This is then concatenated with the original node's vector and sent through the neural network layer and nonlinearity to become the node's representation in the next iteration. Unlike mean aggregator, pooling requires the network to take time to learn extra weights and biases with the benefit of creating more custom embeddings in the case of a more specific task. Through these aggregation functions, we can train our GraphSAGE model to learn the optimal way to compute a weighted average or pooling of node features, while at the same time learning to generate node embeddings that well represent graph data.

In terms of the coding portion, we were not able to fully represent and get the mean aggregates of a team's nodes. So currently running our Graphsage model would try to predict each player on the team and whether or not they made the playoffs, but for future work, we would like to incorporate the mean pooling aggregator so we can change the dimensionality of all players into a single team node. This would likely have increased our test accuracy and provide a more reliable model to build off of.

More to improve

Pertaining specifically to our project, we believe that there were other factors that we could've changed in order to produce a higher accuracy. The biggest issue itself was the data that we used from eightthirtyfour; because it didn't come from an official NBA source, there were a couple things that we had to manually revise. The algorithm that we created had to be revised multiple times because of errors that were seen within the data. In addition, although we did our best to clean the data that we needed to cross reference, there may have been names that were misspelled or have a specific punctuation that just didn't match. We believe that utilizing a playing time network and statistics proves to be a nice proof of concept to build off of, and with some cleaner data there are definitely ways that would improve our utilization of GraphSage. Later on, we can also implement GCN for our model so that we can get a comparison result with GraphSAGE that we are using for this report.

8. Responsibility

- MengYuan Shi: Responsible for the paper researching and writing the report part as well as the visualization.

- Austin Le: Responsible for the data cleaning and data scraping and the coding part as well as the report.

References

[1] William L. Hamilton, Rex Ying, and Jure Leskovec. (2018). Inductive Representation Learning on Large Graphs.

[2] Hongwei Wang, & Jure Leskovec. (2020). Unifying Graph ConvolutionalNeural Networks and Label Propagation.

[3] Relational Dataset Website. Retrieved from https://eightthirtyfour.com/data.

[4] Relational Dataset Website. Retrieved from:
https://www.basketball-reference.com/leagues/NBA2021.html.
https://www.basketball-reference.com/leagues/NBA2020.html.
https://www.basketball-reference.com/leagues/NBA2019.html.
https://www.basketball-reference.com/leagues/NBA2018.html.
https://www.basketball-reference.com/leagues/NBA2017.html.
https://www.basketball-reference.com/leagues/NBA2016.html.

[5] T.N. Kipf and M. Welling. Semi-supervised classification with graph con-volutional networks. In ICLR, 2016.

[6] CrossEntropyLoss. Pytorch. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html.

Appendix

A. Project Proposal

For our project, We will investigate the social network for individual NBA players and the relationship between each team. The goal of this project is to: use team statistics or players’ statistics and analysis for predicting who wins games (by leveraging the players statistics and team statistics from the 2015-2019 season to predict the winners of NBA games for the 2019-2020 season). In order to have enough training data, we choose the data from the 2015-2019 season (may use more data as we do the project) and since it’s hard to predict the future season (do not have enough data for the current season) so we choose to predict the 2019-2020 season.

For our domain this quarter, we discussed graph analysis and learned methods such as GCN, node2vec, random walks and so on. For the first week we learned how to measure different kinds of centrality of the graph. For that, we can use the degree centrality to get the top player prediction as part of our project. A player’s node will generally gravitate as the “center” of it’s team because they are considered a team’s core that they will build around. This topic is interesting because we are not investigating the prediction of nodes and their labels, but rather creating a social network of our NBA players dataset. We will use this created network to compare against other teams and see what averages of our selected statistics indicate a more successful team. There have been other forms of prediction regarding NBA players and their standings, as well as betting, but none have utilized GCNs and Neural networks to produce their results. In the replication project, we are focused on passing information through a network and trying to predict labels based on these aggregated features. For our project we want to do sort of a reverse, where we construct a graph based on features from past seasons and use this network to learn over what aggregated features can help produce an NBA championship caliber team.

For the methods that we attempt to solve the problem will be:
1) GraphSAGE: we use players’ statistics and team's statistics from the 2015-2019 season and use those statistics to create a graph classification. The graph will have a total 120 datapoint and it contains a set of node features. For each season we will have 30 data points and we decided to use 4 seasons in total to do our training model. For player statistics we will use PPG, AST, TRB, MP, GP and so on. 30 teams will each be filled with players statistics as individual data points. We will use the team name as our node and there are no edges between each team. And for each node, it will contain the players of that team, edges between players will be the amount of time they are on the court together. We will also label the datapoint as 1,2,3,4,5: finals, conference, semi, playoffs, none. (shows how far the team can go in the nba season)

2) GCN: Using previous nba seasons and the corresponding champion, we will use a GCN to learn over the nba players dataset and create a correlation between the nba champions. We will then run this model across the current NBA players and their projected stats to see which team is predicted to win.

3) We believe an implementation that revolves around Label propagation will be useful when finding comparisons between different teams. Just as an overall goal, teams try to find special “unicorn” and “generational” talent for their team and add other players that would help compliment their playstyle. In terms of a graphical representation, we believe that we could use label propagation amongst teams to determine their neighborhood characteristics and use this to compare to previous successful teams. If they have similarities in makeup, we would see that they may have a shot at the title for the upcoming season.

We would like to produce a report style paper where we will analyze the nba datasets that we will scrape ourselves. We also want to provide visualizations of interesting findings about NBA teams and their characteristics that have separated them from the rest of the teams as winners. If we are not able to find some sort of standings at the end of the season, we will try to predict just one overall champion. Additionally for a stretch goal, we believe that creating some sort of website that will allow user interactions to pick certain players and teams and display their most important characteristics would be an interesting challenge!