NFL Salary Scraping - Part 1

NFL Team Salary Scraping - Part 1

Background and problem statement

I first started thinking about this problem over the 2015-2106 NFL offseason. During that time one of the biggest stories was the holdout and potential free agency of then Quarterback Ryan Fitzpatrack. Over the course of the 2015 season Fitzpatrick helped the Jets win 10 games and had career year where he enjoyed personal bests, with nearly 4,000 yards passing and 31 TD passes (a franchise record). However, the Jets missed the playoffs that year and some argue that Fitzpatrick's play was in a week 17 loss at Buffalo was the contributing factor. With that in the recordbooks do you really have the ability to demand more tha $8 Million a season?

The questions I have been looking to answer are:

  • What is the value of an NFL Player in relation to some statistic? Win/Loss ratio, QBR, TD/Interception Ratio?
  • What are the salaries of the top NFL teams?

I'm sure there are many more questions that could be asked of a comprehensive NFL salary dataset:

  • What are the top earning positions of a particular team and relate that to the teams performance in those categories?
  • And maybe stretching the limits to making a correlation between the number of twitter mentions of a particular player and their salary.

I have found there to be shockingly little information about NFL teams and their player salaries on the internet. While Fantasy football has made player and game statistical information abundantly available the connection of the salary data and the statistics is lacking.

Getting Started

spotrac has good salary, bonus, and contract information. They claim to be:

Spotrac has become the largest online sports team, & player contract resource on the internet.

I used this site to get started with the 2017 salary of the NY Jets. It is my intention to revist this with Part 2 of this post to scrape data for all 32 teams and make a complete NFL Salary dataset.

Reading Data and Writing to a File

I leveraged the BeautifulSoup library to accomplish the webscraping and then wrote the soup object out to a file. This allowed me to just execute the next cell over and over again instead of needing to continually make calls to the website.

from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "http://www.spotrac.com/nfl/new-york-jets/cap/"

def get_page(base_url):
    page = urlopen(base_url)
    soup = BeautifulSoup(page)
    file = open("sample_page.txt", 'w')
    file.write(str(soup))
    file.close()

Reading Data from a File

Execute the above cell once and then start here

with open("sample_page.txt", 'rb') as file:
    for line in file:
        line = line.strip()

Parse the HTML

Import the BeautifulSoup library and use the find method to get the datatable. It is important to note here, that I opened the sample_page with read binary flags "rb". Leaving off the binary was causing the parser to puke 6 ways to Sunday.

# Reimport BeautifulSoup so that I can start here
from bs4 import BeautifulSoup
page = open("sample_page.txt", 'rb')
soup = BeautifulSoup(page, "html.parser")
table = str(soup.find("table","datatable"))

Render the table

Display the HTML table - this is point of reference to understand source data

from IPython.core.display import HTML
HTML(table)
Active Players (53) Pos. Base Salary Signing Bonus Roster Bonus Option Bonus Workout Bonus Restruc. Bonus Misc. Dead Cap Cap Hit Cap %
Wilkerson Muhammad Wilkerson DE $14,750,000 $3,000,000 - - $250,000 - - ($27,000,000) $18,000,000 12.17
Skrine Buster Skrine CB $6,000,000 $1,250,000 - - - $1,250,000 - ($11,000,000) $8,500,000 5.75
Winters Brian Winters G $1,000,000 - $7,000,000 - - - - ($22,000,000) $8,000,000 5.41
Carpenter James Carpenter G $4,450,000 $875,000 $250,000 - - $1,230,000 - ($8,660,000) $6,805,000 4.60
McCown Josh McCown QB $3,000,000 $3,000,000 - - - - $500,000 ($6,000,000) $6,500,000 4.39
Williams Leonard Williams DE $615,000 $2,952,430 $1,513,716 - - - - ($11,009,149) $5,081,146 3.44
Forte Matt Forte RB $4,000,000 $1,000,000 - - - - - ($6,000,000) $5,000,000 3.38
Beachum Kelvin Beachum LT $1,500,000 $1,500,000 $2,000,000 - - - - ($12,000,000) $5,000,000 3.38
Claiborne Morris Claiborne CB $2,500,000 $2,000,000 $218,750 - - - - ($4,500,000) $4,718,750 3.19
Powell Bilal Powell RB $3,750,000 $883,333 - - - - - ($5,516,668) $4,633,333 3.13
Ijalana Benjamin Ijalana RT $2,000,000 $1,000,000 $1,250,000 - - - - ($4,000,000) $4,250,000 2.87
Adams Jamal Adams S $465,000 $3,582,038 - - - - - ($14,328,154) $4,047,038 2.74
McLendon Steve McLendon DT $2,875,000 $750,000 $171,875 $125,000 - - - ($4,500,000) $3,921,875 2.65
Johnson Wesley Johnson C $2,746,000 - - - - - - - $2,746,000 1.86
Williams Marcus Williams CB $2,746,000 - - - - - - - $2,746,000 1.86
Lee Darron Lee ILB $914,620 $1,408,481 - - - - - ($6,519,304) $2,323,101 1.57
Davis Demario Davis ILB $900,000 $100,000 - - - - $1,225,000 ($1,000,000) $2,225,000 1.50
Kearse Jermaine Kearse WR $2,200,000 - - - - - - ($2,200,000) $2,200,000 1.49
Martin Josh Martin OLB $1,550,000 $225,000 - $75,000 - - - ($2,000,000) $1,850,000 1.25
Maye Marcus Maye S $465,000 $726,643 - - - - - ($3,669,485) $1,191,643 0.81
Hackenberg Christian Hackenberg QB $661,832 $397,327 - - - - - ($1,853,816) $1,059,159 0.72
Seferian-Jenkins Austin Seferian-Jenkins TE $936,690 - - - $75,000 - - - $1,011,690 0.68
Pennel Mike Pennel DE $690,000 - $171,875 - $50,000 - - ($50,000) $911,875 0.62
Catanzaro Chandler Catanzaro K $900,000 - - - - - - ($900,000) $900,000 0.61
Harrison Jonotthan Harrison C $700,000 $50,000 $81,250 - - - - ($50,000) $831,250 0.56
Ealy Kony Ealy DE $803,660 - - - - - - - $803,660 0.54
Carter Bruce Carter ILB $775,000 - - - - - - ($775,000) $775,000 0.52
Kerley Jeremy Kerley WR $775,000 - - - - - - ($775,000) $775,000 0.52
Stanford Julian Stanford OLB $775,000 - - - - - - ($775,000) $775,000 0.52
Dozier Dakota Dozier G $690,000 $75,146 - - - - - ($75,146) $765,146 0.52
Petty Bryce Petty QB $615,000 $138,424 - - - - - ($276,848) $753,424 0.51
Jenkins Jordan Jenkins OLB $560,000 $180,980 - - - - - ($542,940) $740,980 0.50
Brooks Terrence Brooks FS $705,000 - - - - - - - $705,000 0.48
Bass David Bass DE $683,824 - - - - - - - $683,823 0.46
Burris Juston Burris CB $540,000 $137,782 - - - - - ($413,349) $677,782 0.46
Stewart ArDarius Stewart WR $465,000 $211,581 - - - - - ($846,324) $676,581 0.46
Miles Rontez Miles FS $615,000 - - - - - - - $615,000 0.42
Qvale Brent Qvale RT $615,000 - - - - - - - $615,000 0.42
Roberts Darryl Roberts CB $615,000 - - - - - - - $615,000 0.42
Sterling Neal Sterling TE $615,000 - - - - - - - $615,000 0.42
Tye Will Tye TE $615,000 - - - - - - - $615,000 0.42
Shell Brandon Shell T $540,000 $58,991 - - - - - ($176,975) $598,991 0.41
Hansen Chad Hansen WR $465,000 $104,912 - - - - - ($419,648) $569,912 0.39
Edwards Lac Edwards P $540,000 $18,948 - - - - - ($56,847) $558,948 0.38
Peake Charone Peake WR $540,000 $17,095 - - - - - ($51,287) $557,095 0.38
Anderson Robby Anderson WR $540,000 $3,333 - - - - - ($6,667) $543,333 0.37
Tomlinson Eric Tomlinson TE $540,000 - - - - - - - $540,000 0.37
Leggett Jordan Leggett TE $465,000 $70,511 - - - - - ($282,045) $535,511 0.36
Donahue Dylan Donahue OLB $465,000 $50,484 - - - - - ($201,937) $515,484 0.35
McGuire Elijah McGuire RB $465,000 $42,126 - - - - - ($168,505) $507,126 0.34
Jones Derrick Jones CB $465,000 $36,261 - - - - - ($145,047) $501,261 0.34
Hennessy Thomas Hennessy LS $465,000 - - - - - - - $465,000 0.31
Thomas Lawrence Thomas DE $437,647 - - - - - - - $437,647 0.30

Table Rows

rows = [row for row in soup.find("table", "datatable").find_all("tr")]
players = []
for row in rows:
    if row.get_text("tr") is not None:
        players.append(row)

len(players)        
54

Column Headers

Get the 12 column headers and put them into a list using list compression - this will be used later to rename the column headers in the dataframe

columns_headers = [col.get_text() for col in players[0].find_all("th") if col.get_text()]
columns_headers
['Active Players  (53)',
 'Pos.',
 'Base Salary',
 'Signing Bonus',
 'Roster Bonus',
 'Option Bonus',
 'Workout Bonus',
 'Restruc. Bonus',
 'Misc.',
 'Dead Cap',
 'Cap Hit',
 'Cap %']

List of player names

Get the player names from the td tags that are of class="player"

I sliced this to get only the first 110 records because the rest have the same labels as above. This should be a list of 53 to reflect the 53 man roster. Some records are last name only like "Wilkerson" and then "Muhammad Wilkerson" so there are duplicates.

td_tags = soup.find_all("td", {"class":"player"})[:110]
player_names = []

for row in td_tags:
    player_names.append(row.get_text())


final_player_names = []
for names in player_names:
        names = names.strip().split("\n")
        if len(names[1:]) > 0:
            final_player_names.append(names[1:])

#slice the list to show only the top 10 names
final_player_names[:10]
[['Muhammad Wilkerson'],
 ['Buster Skrine'],
 ['Brian Winters'],
 ['James Carpenter'],
 ['Josh McCown'],
 ['Leonard Williams'],
 ['Matt Forte'],
 ['Kelvin Beachum'],
 ['Morris Claiborne'],
 ['Bilal Powell']]

Get ALL the data

This pattern gets all the TD cells from ALL the rows. Earlier, I was using a pattern to extract one column at a time and then smash them all back together, but this is way more effecient.

player_data = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]

#Slice the list to show only the 1st record
player_data[1]
['Wilkerson\nMuhammad Wilkerson ',
 'DE',
 '$14,750,000',
 '$3,000,000',
 '-',
 '-',
 '$250,000',
 '-',
 '-',
 '($27,000,000)',
 '$18,000,000 ',
 '12.17']
import pandas as pd
df2 = pd.DataFrame(player_data, columns=None)
df2 = df2.drop(df2.index[[0]])
df2[0] = final_player_names

Clean the Data

Now that we have all the data we need to clean it before it will be in it's final usable format

#Update the column headers
df2.columns = [columns_headers]
df2.head()
Active Players (53) Pos. Base Salary Signing Bonus Roster Bonus Option Bonus Workout Bonus Restruc. Bonus Misc. Dead Cap Cap Hit Cap %
1 [Muhammad Wilkerson] DE $14,750,000 $3,000,000 - - $250,000 - - ($27,000,000) $18,000,000 12.17
2 [Buster Skrine] CB $6,000,000 $1,250,000 - - - $1,250,000 - ($11,000,000) $8,500,000 5.75
3 [Brian Winters] G $1,000,000 - $7,000,000 - - - - ($22,000,000) $8,000,000 5.41
4 [James Carpenter] G $4,450,000 $875,000 $250,000 - - $1,230,000 - ($8,660,000) $6,805,000 4.60
5 [Josh McCown] QB $3,000,000 $3,000,000 - - - - $500,000 ($6,000,000) $6,500,000 4.39
#Use a regex to turn the Cap Hit column into a foat so that we can do math on it
df2['Cap Hit'] = (df2['Cap Hit'].replace( '[\$,)]','', regex=True ).replace( '[(]','-', regex=True ).astype(float))
#Sanity check to make sure the above worked
df2['Cap Hit'].sum()
121989564.0

Mo Wilkerson makes $1.25M Per Game

Assuming 16 games in a season and he makes all of his bonuses.

df2['Player Salary Per Game'] = df2['Cap Hit']/16
df2.head()
Active Players (53) Pos. Base Salary Signing Bonus Roster Bonus Option Bonus Workout Bonus Restruc. Bonus Misc. Dead Cap Cap Hit Cap % Player Salary Per Game
1 [Muhammad Wilkerson] DE $14,750,000 $3,000,000 - - $250,000 - - ($27,000,000) 18000000.0 12.17 1125000.0
2 [Buster Skrine] CB $6,000,000 $1,250,000 - - - $1,250,000 - ($11,000,000) 8500000.0 5.75 531250.0
3 [Brian Winters] G $1,000,000 - $7,000,000 - - - - ($22,000,000) 8000000.0 5.41 500000.0
4 [James Carpenter] G $4,450,000 $875,000 $250,000 - - $1,230,000 - ($8,660,000) 6805000.0 4.60 425312.5
5 [Josh McCown] QB $3,000,000 $3,000,000 - - - - $500,000 ($6,000,000) 6500000.0 4.39 406250.0
#Plot the salaries for each player in a barchart
#Created a new dataframe with only those columns needed - this may not have been necessary
barchart_columns = ['Active Players  (53)', 'Cap Hit']
df_for_chart = df2[barchart_columns]
#Reindex the datafram so that the labels on the chart would show up
df_for_chart = df_for_chart.set_index(df2['Active Players  (53)'])
barchart = df_for_chart.plot(kind='bar', figsize=(12, 9))
barchart.set_xlabel("final_player_names")
<matplotlib.text.Text at 0x117fef208>
#Display the chart
import matplotlib.pyplot as plt
plt.show()

png

Average Salary By Position

Thought it would be interesting to group the players by position and see the spend by position. Perhaps I'll create another higher level grouping by creating a new variable and taging all of the defensive positions as 'DEF' and all the offensive positions as 'OFF' and then group them that way. With that I'll see how the team ranks overall as on defense as compared to their spend.

grouped_data = df2.groupby(df2['Pos.']).sum()
grouped_data = grouped_data.sort_values(by='Cap Hit', ascending = False)
barchart_by_position = grouped_data['Cap Hit'].plot(kind='bar', figsize=(12, 9))
plt.show()

png

blogroll

social