NFL Salary Scraping

Background and problem statement

This is the second post in a multi-part post. If you missed the first post, you can get it here. In part 2 of this post I'm am looking to collect all the NFL Salary cap data from the 32 teams spread over individual webpages from www.spoctrac.com and merge that into 1 large data set. This final data set can then be used to answer questions such as: - Who are the highest paid players in the NFL? - Which positions are the highest paid? - For those players that are highest paid where do they rank statistically? - For those teams that have the highest paid positions where do they rank statistically? Football is after all a team sport :football: - And to be push the limits is there a correlation between the number of times a player is mentioned on twitter and their salary?

Imports and Function definitions

Throughout this post I'm using the BeautifulSoup library to parse the webpage and the urllib library to get pages. I have also defined 2 functions to support this. - get_page is used to build a list of 32 URLS, 1 for each of the teams that will be used iterated over and parsed using the next function.
- get_team_table accepts a url, opens that as a page, and then uses BeautufilSoup to parse it and return a soup object.

from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "http://www.spotrac.com/nfl/"

def get_page(url):
    page = urlopen(base_url)
    soup = BeautifulSoup(page, 'lxml')
    file = open("spotrac_urls.txt", 'w')
    file.write(str(soup))
    file.close()

def get_team_table(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')

Get all the BASE URLS

The http://www.spotrac.com/nfl/ has links for each of the 32 NFL teams. On those pages there is a datatable that has that teams Salary cap data. To get those links so that they can be used later used the get_page function and write the contents to a file.

get_page(base_url)

with open("spotrac_urls.txt", 'r') as file:
    for line in file:
        line = line.strip()

from bs4 import BeautifulSoup
page = open("spotrac_urls.txt", 'r')
soup = BeautifulSoup(page, "lxml")
div = soup.find("div","subnav-posts")

# This is a terribly simple regex to loop through all the div tags and find all the anchor tags.
# Then I just get the href and print them out.
import re
links = div.find_all('a')
for link in links:
    print(link.get('href'))

len(links)

http://www.spotrac.com/nfl/arizona-cardinals/cap/
http://www.spotrac.com/nfl/atlanta-falcons/cap/
http://www.spotrac.com/nfl/baltimore-ravens/cap/
http://www.spotrac.com/nfl/buffalo-bills/cap/
http://www.spotrac.com/nfl/carolina-panthers/cap/
http://www.spotrac.com/nfl/chicago-bears/cap/
http://www.spotrac.com/nfl/cincinnati-bengals/cap/
http://www.spotrac.com/nfl/cleveland-browns/cap/
http://www.spotrac.com/nfl/dallas-cowboys/cap/
http://www.spotrac.com/nfl/denver-broncos/cap/
http://www.spotrac.com/nfl/detroit-lions/cap/
http://www.spotrac.com/nfl/green-bay-packers/cap/
http://www.spotrac.com/nfl/houston-texans/cap/
http://www.spotrac.com/nfl/indianapolis-colts/cap/
http://www.spotrac.com/nfl/jacksonville-jaguars/cap/
http://www.spotrac.com/nfl/kansas-city-chiefs/cap/
http://www.spotrac.com/nfl/los-angeles-chargers/cap/
http://www.spotrac.com/nfl/los-angeles-rams/cap/
http://www.spotrac.com/nfl/miami-dolphins/cap/
http://www.spotrac.com/nfl/minnesota-vikings/cap/
http://www.spotrac.com/nfl/new-england-patriots/cap/
http://www.spotrac.com/nfl/new-orleans-saints/cap/
http://www.spotrac.com/nfl/new-york-giants/cap/
http://www.spotrac.com/nfl/new-york-jets/cap/
http://www.spotrac.com/nfl/oakland-raiders/cap/
http://www.spotrac.com/nfl/philadelphia-eagles/cap/
http://www.spotrac.com/nfl/pittsburgh-steelers/cap/
http://www.spotrac.com/nfl/san-francisco-49ers/cap/
http://www.spotrac.com/nfl/seattle-seahawks/cap/
http://www.spotrac.com/nfl/tampa-bay-buccaneers/cap/
http://www.spotrac.com/nfl/tennessee-titans/cap/
http://www.spotrac.com/nfl/washington-redskins/cap/





32

from urllib.request import urlopen
def get_team_table(url):
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    data_rows = [row for row in soup.find("table", "datatable").find_all("tr")]
    return data_rows

Get the data

For all 32 teams we get the table rows using the get_team_table functions

# create an empty list
team_data = []

for link in links:
    team_data.append(get_team_table(link.get('href')))

#Make sure there are 32 elements in team_data
len(team_data)

The heavy lifting

Build player_data from the combined table rows that were parsed from all the links above. This takes some time to execute.

#data_rows = [row for row in soup.find("td", "center").find_all("tr")]
table_data = []

#soup = BeautifulSoup(team_data[0], 'lxml')

#This needs to be a nested for loop because inner items of the list are BeautifulSoup Elements
for row in team_data:
    for element in row:
        #print(type(element))
        if soup.find_all("td", attrs={"class":" right xs-hide "}) is not None:
            table_data.append(element.get_text())

player_data = []
for row in table_data:
    player_data.append(row.split("\n"))
    #print(player_data)

len(player_data)

Loada DataFrame and reshape the data

In the next cells I load the data into a DataFrame and reshape it so that it has 12 columns, fix the header, and remove the invalid rows

import pandas as pd
df = pd.DataFrame(player_data)
df = df.drop(14, 1)
df = df.drop(0, 1)
df = df.drop(1, 1)


df = df.drop(df.index[[0]])
#df.set_index(1, inplace=True)
print(df.shape)
df.head()

(1724, 12)

	2	3	4	5	6	7	8	9	10	11	12	13
1	Larry Fitzgerald	WR	$11,000,000	$2,500,000	-	-	-	$2,350,000	-	-	$15,850,000	9.44
2	Patrick Peterson	CB	$9,750,000	$3,072,377	-	-	$250,000	$634,588	-	($18,048,518)	$13,706,965	8.16
3	Chandler Jones	OLB	$7,000,000	$3,000,000	-	-	-	-	-	($31,000,000)	$10,000,000	5.96
4	Jared Veldheer	RT	$6,500,000	$1,250,000	$250,000	-	-	$2,000,000	-	($6,500,000)	$10,000,000	5.96
5	Tyrann Mathieu	FS	$4,750,000	$3,100,000	-	-	$250,000	-	-	($17,400,000)	$8,100,000	4.82

players = []
for row in team_data[0]:
    if row.get_text("tr") is not None:
        players.append(row) 

column_headers = [col.get_text() for col in players[0].find_all("th") if col.get_text()]
len(column_headers)

df.columns = column_headers
df.head()

	Active Players (53)	Pos.	Base Salary	Signing Bonus	Roster Bonus	Option Bonus	Workout Bonus	Restruc. Bonus	Misc.	Dead Cap	Cap Hit	Cap %
1	Larry Fitzgerald	WR	$11,000,000	$2,500,000	-	-	-	$2,350,000	-	-	$15,850,000	9.44
2	Patrick Peterson	CB	$9,750,000	$3,072,377	-	-	$250,000	$634,588	-	($18,048,518)	$13,706,965	8.16
3	Chandler Jones	OLB	$7,000,000	$3,000,000	-	-	-	-	-	($31,000,000)	$10,000,000	5.96
4	Jared Veldheer	RT	$6,500,000	$1,250,000	$250,000	-	-	$2,000,000	-	($6,500,000)	$10,000,000	5.96
5	Tyrann Mathieu	FS	$4,750,000	$3,100,000	-	-	$250,000	-	-	($17,400,000)	$8,100,000	4.82

#The header repeated itself in the data.  This didn't reveal itself until the data type conversion step below
#but this fixes all occurrences of it.
rows_to_be_dropped = df.loc[df['Cap Hit'] == 'Cap %'].index
df = df.drop(rows_to_be_dropped)

#Apply a regex to convert the 'Cap Hit' column from a string to a float.  
df['Cap Hit'] =(df['Cap Hit'].replace('[\$,)]', "", regex=True).replace( '[(]','-',   regex=True ).astype(float))

#Sanity check to make sure it worked.
df['Cap Hit'].sum()

4159332264.0

df.shape
#53*32 = 1696 there should be this many rows in the data set, how did I end up with 3 extra??
#That's such an odd number?

(1693, 13)

import matplotlib.pyplot as plt
df['Cap Hit'].head(25).plot(kind="bar",figsize=(15,12))
plt.show()

png

df['Cap Average By Game'] = df['Cap Hit']/16

df.head(25)
#Assuming 16 games & all his bunuses - Larry Fitzgerald makes $9.9M a game!!

	Active Players (53)	Pos.	Base Salary	Signing Bonus	Roster Bonus	Option Bonus	Workout Bonus	Restruc. Bonus	Misc.	Dead Cap	Cap Hit	Cap %	Cap Average By Game
1	Larry Fitzgerald	WR	$11,000,000	$2,500,000	-	-	-	$2,350,000	-	-	15850000.0	9.44	990625.0000
2	Patrick Peterson	CB	$9,750,000	$3,072,377	-	-	$250,000	$634,588	-	($18,048,518)	13706965.0	8.16	856685.3125
3	Chandler Jones	OLB	$7,000,000	$3,000,000	-	-	-	-	-	($31,000,000)	10000000.0	5.96	625000.0000
4	Jared Veldheer	RT	$6,500,000	$1,250,000	$250,000	-	-	$2,000,000	-	($6,500,000)	10000000.0	5.96	625000.0000
5	Tyrann Mathieu	FS	$4,750,000	$3,100,000	-	-	$250,000	-	-	($17,400,000)	8100000.0	4.82	506250.0000
6	Drew Stanton	QB	$2,550,000	$1,000,000	$500,000	-	-	-	-	($3,550,000)	4050000.0	2.41	253125.0000
7	Jermaine Gresham	TE	$2,000,000	$1,750,000	-	-	-	-	-	($13,000,000)	3750000.0	2.23	234375.0000
8	Corey Peters	DT	$2,750,000	$666,668	$93,750	-	$150,000	-	-	($3,566,668)	3660418.0	2.18	228776.1250
9	Justin Bethel	CB	$2,000,000	$1,500,000	-	-	-	-	-	($3,500,000)	3500000.0	2.08	218750.0000
10	Antoine Bethea	SS	$2,000,000	$666,666	$500,000	-	$250,000	-	-	($4,250,000)	3416666.0	2.04	213541.6250
11	Phil Dawson	K	$2,000,000	$500,000	-	-	-	-	-	($3,000,000)	2500000.0	1.49	156250.0000
12	Josh Mauro	DT	$750,000	$500,000	$1,250,000	-	-	-	-	($3,000,000)	2500000.0	1.49	156250.0000
13	Deone Bucannon	ILB	$1,471,255	$981,673	-	-	-	-	-	($981,673)	2452928.0	1.46	153308.0000
14	Haason Reddick	ILB	$465,000	$1,985,539	-	-	-	-	-	($7,942,157)	2450539.0	1.46	153158.6875
15	Karlos Dansby	ILB	$1,000,000	$700,000	$200,000	-	$100,000	-	$250,000	($1,800,000)	2250000.0	1.34	140625.0000
16	Robert Nkemdiche	DE	$840,936	$1,113,745	-	-	-	-	-	($6,536,854)	1954681.0	1.16	122167.5625
17	Tramon Williams	CB	$1,100,000	$500,000	$350,000	-	-	-	-	($1,600,000)	1950000.0	1.16	121875.0000
18	Alex Boone	G	$900,000	-	$437,500	-	-	-	$600,000	($1,400,000)	1937500.0	1.15	121093.7500
19	John Brown	WR	$1,797,000	$135,725	-	-	-	-	-	($135,725)	1932725.0	1.15	120795.3125
20	A.Q. Shipley	C	$775,000	$362,500	$250,000	-	-	-	$250,000	($1,500,000)	1637500.0	0.98	102343.7500
21	Jaron Brown	WR	$775,000	$317,500	$218,750	-	-	-	-	($1,092,500)	1311250.0	0.78	81953.1250
22	Troy Niklas	TE	$968,632	$311,509	-	-	-	-	-	($311,509)	1280141.0	0.76	80008.8125
23	Budda Baker	S	$465,000	$776,819	-	-	-	-	-	($3,882,732)	1241819.0	0.74	77613.6875
24	Frostee Rucker	DT	$1,000,000	$80,000	-	-	-	-	-	($1,080,000)	1080000.0	0.64	67500.0000
25	Andy Lee	P	$1,000,000	-	-	-	-	-	-	($1,000,000)	1000000.0	0.60	62500.0000

Wcontractor.github.io

NFL Salary Scraping - Part 2

Background and problem statement

Imports and Function definitions

Get all the BASE URLS

Get the data

The heavy lifting

Loada DataFrame and reshape the data

Background and problem statement

Imports and Function definitions

Get all the BASE URLS

Get the data

The heavy lifting

Loada DataFrame and reshape the data

blogroll

social