Finding The K Nearest Neighbor Of Nba Teams

13 minute read | Updated:

Finding the k-nearest neighbor of NBA teams' offense, defense, and overall.

Finding the k-nearest neighbor of NBA teams' offense, defense, and overall.

### Preface: I've had a python script scraping NBA player & team stats since the start of the 2015-16 NBA season that shoves the data into a digital ocean database. In this post we will query that database, use pandas to clean it up and insert necessary columns, and use scipy to find an NBA team's nearest offensive, defensive and overall neighbors.

Preface: I've had a python script scraping NBA player & team stats since the start of the 2015-16 NBA season that shoves the data into a digital ocean database. In this post we will query that database, use pandas to clean it up and insert necessary columns, and use scipy to find an NBA team's nearest offensive, defensive and overall neighbors.

### First, a short note on <a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" style="text-decoration:none">k-nearest neighbors</a>. Simply put, it's a basic machine-learning algorithm to predict unknown values by matching them with the most similar known values. In this post, we'll focus on clustering the neighbors around the input team and finding the <a href="https://en.wikipedia.org/wiki/Euclidean_distance" style="text-decoration:none">euclidean distance</a> of all other teams.

First, a short note on k-nearest neighbors. Simply put, it's a basic machine-learning algorithm to predict unknown values by matching them with the most similar known values. In this post, we'll focus on clustering the neighbors around the input team and finding the euclidean distance of all other teams.

### First let's query our <a href="https://www.digitalocean.com" style="text-decoration:none;"/>digitalocean</a>  database. We'll pass our connection string to SQLAlchemy's ```create_engine``` function, then use Pandas' ```.read_sql_query``` method to get the data.

First let's query our digitalocean database. We'll pass our connection string to SQLAlchemy's create_engine function, then use Pandas' .read_sql_query method to get the data.

In [3]:
from sqlalchemy import create_engine
import pandas as pd
import pandas.io.sql as psql
pd.options.display.max_columns = None
sql = """select distinct dateof, playername, team, 
         starterorbench, pt, r, a, pf, fg, fgatt, fg3, 
         fgatt3, ft, ftatt, to_, stl, bk, min, fpt, opp
         from nbaplayers1516 
         order by dateof asc"""
engine = create_engine('postgres://postgres@localhost:5432/sports1')
dfp = pd.read_sql_query(sql, engine)
### Here we're getting 3-point percentage and counting the games payed by each player.

Here we're getting 3-point percentage and counting the games payed by each player.

In [5]:
#fg3_p
dfp['fg3p'] = dfp['fg3'] / dfp['fgatt3']
#count games played
dfp['gamesplayed'] = dfp[dfp['min'] > 0].groupby(['playername','team']).cumcount()+1
xxxxxxxxxx
### But some players only play a few minutes per game, if at all, therefore we'll calculate actual games played by counting those games where the player played more minutes than the average minutes played by a player per game using Pandas' awesome ```groupby``` method and ```cumcount```, which performs a cumulative count of all rows that meet the criteria (```+1``` since it start at zero). Then we get a player's total minutes played using ```.cumsum()```.

But some players only play a few minutes per game, if at all, therefore we'll calculate actual games played by counting those games where the player played more minutes than the average minutes played by a player per game using Pandas' awesome groupby method and cumcount, which performs a cumulative count of all rows that meet the criteria (+1 since it start at zero). Then we get a player's total minutes played using .cumsum().

In [6]:
#account for low minutes
dfp['agamesplayed'] = dfp[dfp['min'] > dfp['min'].mean()].groupby(['playername','team']).cumcount()+1
# total minutes
dfp['totalmin'] = dfp.groupby(['playername','team'])['min'].cumsum()
#convert datetimes to strings
dfp['dateof'] = dfp['dateof'].apply(lambda x: x.strftime('%Y-%m-%d'))
### Next we'll query team data. When I first wrote the web-scraping script, I pushed the data into the database with only team data, not ooponent data; that is, each row contains the team name, opp name and only team stats. Therefore, I have to join the table to itself to output rows that contain team and opp data so we can more easily calculate how a team performs against its opponents. At first, we exclude the all star teams so our data doesn't get skewed, but we'll also measure up the teams against them later on.

Next we'll query team data. When I first wrote the web-scraping script, I pushed the data into the database with only team data, not ooponent data; that is, each row contains the team name, opp name and only team stats. Therefore, I have to join the table to itself to output rows that contain team and opp data so we can more easily calculate how a team performs against its opponents. At first, we exclude the all star teams so our data doesn't get skewed, but we'll also measure up the teams against them later on.

In [8]:
sql = """select distinct h.dateof, h.team, h.opp, h.points, 
            h.fg, h.fgatt, h.ft, h.ftatt, 
            h.fg3, h.fg3att, h.offrebounds, h.defrebounds, h.totalrebounds,
            h.assists, h.blocks, h.fouls, h.steals, h.turnovers,
            a.points as opoints, a.fg as ofg, a.fgatt as ofgatt, 
            a.ft as oft, a.ftatt as oftatt, 
            a.fg3 as ofg3, a.fg3att as ofg3att, 
            a.offrebounds as ooffrebounds, a.defrebounds as odefrebounds,
            a.totalrebounds as ototalrebounds, a.assists as oassists, 
            a.blocks as oblocks, a.fouls as ofouls,
            a.steals as osteals, a.turnovers as oturnovers, 
            (h.fgatt + h.turnovers + (0.475*h.ftatt) - h.offrebounds) as poss_calc,
            (a.fgatt + a.turnovers + (0.475*a.ftatt) - a.offrebounds) as oposs_calc
        from nbateamstats1516 h
        join nbateamstats1516 a 
            on h.opp = a.team 
            and h.team = a.opp
        where h.dateof = a.dateof
        and (h.team <> 'WES' or a.team <> 'EAS')
        and (a.team <> 'WES' or h.team <> 'EAS')
        order by h.dateof asc"""
df = pd.read_sql_query(sql, engine)
In [9]:
# create fg3 percentage column
df['fg3p'] = df['fg3'] / df['fg3att']
#create fg percentage column
df['fgp'] = df['fg'] / df['fgatt']
### To accuratley compare the teams, we'll use Dean Oliver's <a href="http://www.basketball-reference.com/about/factors.html" style="text-decoration:none;">four factors</a>.

To accuratley compare the teams, we'll use Dean Oliver's four factors.

In [10]:
# four factors
df['efgp'] = (df['fg'] + (0.5 * df['fg3'])) / df['fgatt']
df['oefgp'] = (df['ofg'] + (0.5 * df['ofg3'])) / df['ofgatt']
df['ftrate'] = df['ft'] / df['fgatt']
df['oftrate'] = df['oft'] / df['ofgatt']
df['torate'] = df['turnovers'] / (df['fgatt'] + (0.44 * df['ftatt']) + df['turnovers'])
df['otorate'] = df['oturnovers'] / (df['ofgatt'] + (0.44 * df['oftatt']) + df['oturnovers'])
df['offrebound_p'] = df['offrebounds'] / (df['offrebounds'] + df['odefrebounds'])
df['ooffrebound_p'] = df['ooffrebounds'] / (df['ooffrebounds'] + df['defrebounds'])
# gamecount
df['gamecount'] = df.groupby(['team']).cumcount()+1
df['ogamecount'] = df.groupby(['opp']).cumcount()+1
### Let's take the mean of all teams using ```groupby('team').mean()```, then add the team column created from the index.

Let's take the mean of all teams using groupby('team').mean(), then add the team column created from the index.

In [11]:
nba = df.groupby('team').mean()
nba['team'] = nba.index.get_level_values('team')
#convert datetimes to strings for easy comparison
df['dateof'] = df['dateof'].apply(lambda x: x.strftime('%Y-%m-%d'))
xxxxxxxxxx
### Next, we need to sum player mintues so we can calculate the <a href="http://www.sportingcharts.com/dictionary/nba/pace-factor.aspx" style="text-decoration:none">pace factor</a>, which is ```((team poss + opp poss) / (2 * (team min played / 5)))```. But to do that we also need to calculate <a href="http://www.basketball-reference.com/about/glossary.html" style="text-decoration:none;">team and opp possessions</a>: ```0.5 * ((Tm FGA + 0.4 * Tm FTA - 1.07 * (Tm ORB / (Tm ORB + Opp DRB)) * (Tm FGA - Tm FG) + Tm TOV) + (Opp FGA + 0.4 * Opp FTA - 1.07 * (Opp ORB / (Opp ORB + Tm DRB)) * (Opp FGA - Opp FG) + Opp TOV))```.
### Pandas makes this really, *really* easy. We just use ```.groupby(['team', 'dateof'])['min']``` to group the data, then ```.apply``` a ```lambda``` function to ```.sum()``` all individual minutes played per team per game. Then we take the ```.mean()```. The rest is just plug-n-chug.
### Damn! Thank you, <a href="https://twitter.com/Python_Pandas" style="text-decoration:none;">Pandas</a>.

Next, we need to sum player mintues so we can calculate the pace factor, which is ((team poss + opp poss) / (2 * (team min played / 5))). But to do that we also need to calculate team and opp possessions: 0.5 * ((Tm FGA + 0.4 * Tm FTA - 1.07 * (Tm ORB / (Tm ORB + Opp DRB)) * (Tm FGA - Tm FG) + Tm TOV) + (Opp FGA + 0.4 * Opp FTA - 1.07 * (Opp ORB / (Opp ORB + Tm DRB)) * (Opp FGA - Opp FG) + Opp TOV)).

Pandas makes this really, really easy. We just use .groupby(['team', 'dateof'])['min'] to group the data, then .apply a lambda function to .sum() all individual minutes played per team per game. Then we take the .mean(). The rest is just plug-n-chug.

Damn! Thank you, Pandas.

In [12]:
teamMP = dfp.groupby(['team', 'dateof'])['min'].apply(lambda x: x.sum()).mean()
    
df['poss'] =  0.5 * ((df['fgatt'] + 0.4 * df['ftatt'] - 1.07 * (df['offrebounds'] / (df['offrebounds'] + df['odefrebounds'])) * (df['fgatt'] - df['fg']) + df['turnovers']) + (df['ofg'] + 0.4 * df['oftatt'] - 1.07 * (df['ooffrebounds'] / (df['ooffrebounds'] + df['defrebounds'])) * (df['ofgatt'] - df['ofg']) + df['oturnovers']))
df['oposs'] =  0.5 * ((df['ofgatt'] + 0.4 * df['oftatt'] - 1.07 * (df['ooffrebounds'] / (df['ooffrebounds'] + df['defrebounds'])) * (df['ofgatt'] - df['ofg']) + df['oturnovers']) + (df['fg'] + 0.4 * df['ftatt'] - 1.07 * (df['offrebounds'] / (df['offrebounds'] + df['odefrebounds'])) * (df['fgatt'] - df['fg']) + df['turnovers']))
df['pace'] = 48 * ((df['poss'] + df['oposs']) / (2 * ( teamMP/ 5)))
xxxxxxxxxx
### Now for the fun stuff. Here is where we define our nearest neighbor function. We give the user the option to choose a team, a side and wheather the data should be weighted.
### Let's take a close look at what's going on.
### First we create the columns to include in the <a href="https://en.wikipedia.org/wiki/Metric_space" style="text-decoration:none;">metric space</a>; this is where we use Oliver's four factors. Based on the ```side```, we tell the formula to use offensive or defensive or *all* columns. Next we get the ```.mean()``` of all the columns in the dataframe and create a new dataframe called ```nba_norm``` which we will normalize by setting the mean of all columns to 0 and the standard deviation to 1. This ensures that no column has a dominant impact on the euclidean distance since points scored is higher than blocks, steals or turnovers, etc. Next we weight our columns; I give effective field-goal percentage a slightly higher weighting than Oliver (his 40% to my 50%). Then we use the specified team as our point of comparison and calculate ```euclidean_distances``` using SciPy's ```.euclidean()``` method. Lastly, we return a dataframe that contains the teams and euclidean_distances.

Now for the fun stuff. Here is where we define our nearest neighbor function. We give the user the option to choose a team, a side and wheather the data should be weighted.

Let's take a close look at what's going on.

First we create the columns to include in the metric space; this is where we use Oliver's four factors. Based on the side, we tell the formula to use offensive or defensive or all columns. Next we get the .mean() of all the columns in the dataframe and create a new dataframe called nba_norm which we will normalize by setting the mean of all columns to 0 and the standard deviation to 1. This ensures that no column has a dominant impact on the euclidean distance since points scored is higher than blocks, steals or turnovers, etc. Next we weight our columns; I give effective field-goal percentage a slightly higher weighting than Oliver (his 40% to my 50%). Then we use the specified team as our point of comparison and calculate euclidean_distances using SciPy's .euclidean() method. Lastly, we return a dataframe that contains the teams and euclidean_distances.

In [13]:
from scipy.spatial import distance
def nn(team, side=None, weight=False):
    
    # four factors: off, def
    dist_cols = {"off": ['efgp', 'ftrate', 'torate', 'offrebound_p', 'pace'],
                 "def":['oefgp', 'oftrate', 'otorate', 'ooffrebound_p', 'oposs']}
    
    if side == 'off':
        cols = dist_cols['off']
    elif side == 'def':
        cols = dist_cols['def']
    elif side == 'overall':
        cols = dist_cols['off'] + dist_cols['def']
        
    nba = df.groupby('team').mean()
    nba['team'] = nba.index.get_level_values('team')
    nba_num = nba[cols]
    
    nba_norm = (nba_num - nba_num.mean())/nba_num.std()
    
    if weight:
        if side == 'off':
            nba_norm['efgp'] = nba_norm['efgp'] * 1.5
            nba_norm['ftrate'] = nba_norm['ftrate'] * 1.1
            nba_norm['torate'] = nba_norm['torate'] * 1.25
            nba_norm['offrebound_p'] = nba_norm['offrebound_p'] * 1.15
        elif side == 'def':
            nba_norm['oefgp'] = nba_norm['oefgp'] * 1.5
            nba_norm['oftrate'] = nba_norm['oftrate'] * 1.1
            nba_norm['otorate'] = nba_norm['otorate'] * 1.25
            nba_norm['ooffrebound_p'] = nba_norm['ooffrebound_p'] * 1.15
            
    team_norm = nba_norm[nba.team==team]
    euclidean_distances = nba_norm.apply(lambda row: distance.euclidean(row, team_norm), axis=1)
    
    distance_frame = pd.DataFrame(data={"euclidean_distance": euclidean_distances, "idx": euclidean_distances.index})
    distance_frame.sort("euclidean_distance", inplace=True)
    return distance_frame
### Here are the 5 most similar teams to Golden State. It's tempting to say that because GS's offense is *so* good that this can be used as a ranking tool, but that's not the case. These teams are just the most *similar* offensive teams, which doesn't mean that OKC is worse offensively than SA. Different teams play at different paces, which would impact their scoring and the opp's scoring which would impact their euclidean distance.

Here are the 5 most similar teams to Golden State. It's tempting to say that because GS's offense is so good that this can be used as a ranking tool, but that's not the case. These teams are just the most similar offensive teams, which doesn't mean that OKC is worse offensively than SA. Different teams play at different paces, which would impact their scoring and the opp's scoring which would impact their euclidean distance.

### Let's plot this. It is interesting to note just *how* different even SA is than GS. SAC is *much* closer to SA than SA is to GS, which implies that there is no team that is all that close to GS. Yea. I like GS, but so do the numbers. And the LAL are, of course, much more different than GS than even NY is; just look at that spike!

Let's plot this. It is interesting to note just how different even SA is than GS. SAC is much closer to SA than SA is to GS, which implies that there is no team that is all that close to GS. Yea. I like GS, but so do the numbers. And the LAL are, of course, much more different than GS than even NY is; just look at that spike!

In [15]:
import matplotlib.pyplot as plt
%matplotlib inline
results = nn('GS', side='off', weight=True)
plt.xkcd()
xticks = results.index.values.tolist()
results.plot(figsize=(17, 8))
plt.xticks(range(len(results)), xticks)
plt.xlabel('team', fontsize=18)
plt.ylabel('euclidean_distance', fontsize=16);
/usr/local/lib/python2.7/site-packages/IPython/kernel/__main__.py:38: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
### Ok so let's forget about GS. Let's now look at the all-powerful SA defense (though, not that powerful since they got smoked by GS, but moving on...). Just as it was clear that no team is that similar to GS, it's also apparent that SA stands apart. Again, no matter how tempting it is to say that SA is defensively *better* because of what this graph illustrates, we must remember that that's not what this formula determines.

Ok so let's forget about GS. Let's now look at the all-powerful SA defense (though, not that powerful since they got smoked by GS, but moving on...). Just as it was clear that no team is that similar to GS, it's also apparent that SA stands apart. Again, no matter how tempting it is to say that SA is defensively better because of what this graph illustrates, we must remember that that's not what this formula determines.

In [17]:
results = nn('GS', side='def', weight=True)
xticks = results.index.values.tolist()
results.plot(figsize=(17, 8))
plt.xkcd()
plt.xticks(range(len(results)), xticks)
plt.xlabel('team', fontsize=18)
plt.ylabel('euclidean_distance', fontsize=16);
/usr/local/lib/python2.7/site-packages/IPython/kernel/__main__.py:38: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
xxxxxxxxxx
### Just for fun, let's take a look at the all-star teams. In the all star games, the pace is fast and the defense is all but non-existent (even that's a stretch). So let's take a look at the most offensively and defensively similar teams to the EAST and WEST all-star teams.
### Since the west scored many more points than the east, let's take a look at how other teams match up against them offensively.
### And we can see that of course GS is the most similar, regardless of the *huge* spike. 

Just for fun, let's take a look at the all-star teams. In the all star games, the pace is fast and the defense is all but non-existent (even that's a stretch). So let's take a look at the most offensively and defensively similar teams to the EAST and WEST all-star teams.

Since the west scored many more points than the east, let's take a look at how other teams match up against them offensively.

And we can see that of course GS is the most similar, regardless of the huge spike.

In [21]:
results = nn('GS', side='off', weight=True)
xticks = results.index.values.tolist()
results.plot(figsize=(17, 8))
plt.xticks(range(len(results)), xticks)
plt.xlabel('team', fontsize=18)
plt.ylabel('euclidean_distance', fontsize=16);
/usr/local/lib/python2.7/site-packages/IPython/kernel/__main__.py:38: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
xxxxxxxxxx
### Now to the EAST's defense, if we can call it that. 
### I think it's safe to say that the EAST's defense, even though not meant to be over-bearing (since it is the all-star game afterall), is the worst defense we'll see all year round. So let's compare.
### And...I don't think it's a surprise that the most similar team is the LAL, while SA is the most dissimilar.

Now to the EAST's defense, if we can call it that.

I think it's safe to say that the EAST's defense, even though not meant to be over-bearing (since it is the all-star game afterall), is the worst defense we'll see all year round. So let's compare.

And...I don't think it's a surprise that the most similar team is the LAL, while SA is the most dissimilar.

In [22]:
results = nn('GS', side='def', weight=True)
xticks = results.index.values.tolist()
results.plot(figsize=(17, 8))
plt.xticks(range(len(results)), xticks)
plt.xlabel('team', fontsize=18)
plt.ylabel('euclidean_distance', fontsize=16);
/usr/local/lib/python2.7/site-packages/IPython/kernel/__main__.py:38: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)