Background
Sentiment Analysis is a topic that has fascinated me for some time. Examples like the analysis of @realDonaldTrump to classify tweets as positive or negative captivated me. To perform my own analysis, very simply, I data collected tweets during the England V Croatia World Cup Semi-Final game and proceeded through the steps of pre-processing, processing, sentiment analysis.
The tweets were collected through by a Python script I ran on a AWS instance. I used the tweepy api to collect 376 records in a JSON file and then used that for the following steps.
import os, json, pprint
import pandas as pd
from nltk.corpus import stopwords
from nltk import sent_tokenize
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)
tweet_data = []
tweet_file = open('world_cup_tweets.txt', 'r')
for line in tweet_file:
try:
tweet = json.loads(line)
tweet_data.append(tweet)
except:
continue
len(tweet_data)
376
df = pd.DataFrame(tweet_data)
print('There are', df.shape[0], ' rows and', df.shape[1], ' columns in this dataset')
df.columns
There are 376 rows and 37 columns in this dataset
Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
'entities', 'extended_entities', 'extended_tweet', 'favorite_count',
'favorited', 'filter_level', 'geo', 'id', 'id_str',
'in_reply_to_screen_name', 'in_reply_to_status_id',
'in_reply_to_status_id_str', 'in_reply_to_user_id',
'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'limit', 'place',
'possibly_sensitive', 'quote_count', 'quoted_status',
'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
'reply_count', 'retweet_count', 'retweeted', 'retweeted_status',
'source', 'text', 'timestamp_ms', 'truncated', 'user'],
dtype='object')
Drop unnecessary columns
There are a siginifance number of attributes on this data set that will not be used. This step removes the unnessary columns.
df.drop(['contributors', 'coordinates', 'extended_tweet', 'geo', 'created_at', 'display_text_range',
'entities', 'extended_entities', 'favorite_count',
'favorited', 'filter_level', 'id', 'id_str',
'in_reply_to_screen_name', 'in_reply_to_status_id',
'in_reply_to_status_id_str', 'in_reply_to_user_id',
'in_reply_to_user_id_str', 'is_quote_status', 'limit',
'possibly_sensitive', 'quote_count', 'quoted_status',
'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
'reply_count', 'retweet_count', 'retweeted', 'retweeted_status',
'source','timestamp_ms', 'truncated', 'user'], axis=1, inplace=True)
df.shape
(376, 3)
df.dtypes
lang object
place object
text object
dtype: object
Fill in NAN's
There are rows where the text value is NAN and that is causing the pre-processing steps to error with a data type error. Perhaps I could have filtered them out, but I chose to use fillna to fill them in with spaces.
df.text.fillna("", inplace=True)
df.text.dtype
dtype('O')
#df['cleaned'] = df['text'].str.replace('[^\w\s]', '')
#from nltk.corpus import stopwords
#stop = stopwords.words('english')
#df['text'] = df['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
#df['text'].head()
df['word_count'] = df['text'].apply(lambda x: len(str(x).split(" ")))
df.head(10)
lang | place | text | word_count | |
---|---|---|---|---|
0 | en | None | This. https://t.co/UNKep8L5EY | 2 |
1 | en | None | Wanna see England in the final but Croatia is clearly the better side\n\n#ENGCRO #WorldCup | 14 |
2 | en | None | C’mon #England we really can do this. Dig deep. #EnglandvsCroatia #ThreeLionsOnAShirt | 11 |
3 | en | None | RT @FlickSaudi: If England lose This is How The Streets Of London Will Look https://t.co/V8AT2RRn5p | 15 |
4 | en | None | RT @hermannkelly: .@TonightShowTV3 I’m into the breach on the TV3 Tonight Show, Wednesday 11th July at 11pm. #WorldCup \nAs we already… | 23 |
5 | th | None | RT @mthai: เกมส์ยังไม่จบ... ต่อเวลาพิเศษ 30 นาที\n\nโครเอเชีย 🇭🇷 1 : 1 🏴 อังกฤษ\n เปริซิช 68' ⚽… | 47 |
6 | en | None | William made an England flag at nursery today. I hope he gets the need for it on Sunday. #ENGvCRO… https://t.co/Cjf76wWtPG | 20 |
7 | en | None | This is why england can never win d world cup again argue with ur ancestors #ENGCRO | 16 |
8 | en | None | RT @PurelyFootball: Hyde Park when Kieran Trippier scored for England🏴\n\nAbsolute scenes!🍻 https://t.co/SLb6dFcG0g | 12 |
9 | en | None | Honestly? I’m ashamed about the protests in England... I’m ashamed we aren’t as organized and doing as much protest… https://t.co/JBLcWtfgsX | 20 |
Punctuation removal
import string
df['no_punctuation'] = df['text'].str.replace('[^\w\s]','')
df.head()
lang | place | text | word_count | no_punctuation | |
---|---|---|---|---|---|
0 | en | None | This. https://t.co/UNKep8L5EY | 2 | This httpstcoUNKep8L5EY |
1 | en | None | Wanna see England in the final but Croatia is clearly the better side\n\n#ENGCRO #WorldCup | 14 | Wanna see England in the final but Croatia is clearly the better side\n\nENGCRO WorldCup |
2 | en | None | C’mon #England we really can do this. Dig deep. #EnglandvsCroatia #ThreeLionsOnAShirt | 11 | Cmon England we really can do this Dig deep EnglandvsCroatia ThreeLionsOnAShirt |
3 | en | None | RT @FlickSaudi: If England lose This is How The Streets Of London Will Look https://t.co/V8AT2RRn5p | 15 | RT FlickSaudi If England lose This is How The Streets Of London Will Look httpstcoV8AT2RRn5p |
4 | en | None | RT @hermannkelly: .@TonightShowTV3 I’m into the breach on the TV3 Tonight Show, Wednesday 11th July at 11pm. #WorldCup \nAs we already… | 23 | RT hermannkelly TonightShowTV3 Im into the breach on the TV3 Tonight Show Wednesday 11th July at 11pm WorldCup \nAs we already |
from bs4 import BeautifulSoup
import re
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
pat1 = r'@[A-Za-z0-9]+'
pat2 = r'https?://[A-Za-z0-9./]+'
combined_pat = r'|'.join((pat1, pat2))
def tweet_cleaner(text):
soup = BeautifulSoup(text, 'lxml')
souped = soup.get_text()
stripped = re.sub(combined_pat, '', souped)
try:
clean = stripped.decode("utf-8-sig").replace(u"\ufffd", "?")
except:
clean = stripped
letters_only = re.sub("[^a-zA-Z]", " ", clean)
lower_case = letters_only.lower()
# During the letters_only process two lines above, it has created unnecessay white spaces,
# I will tokenize and join together to remove unneccessary white spaces
words = tok.tokenize(lower_case)
return (" ".join(words)).strip()
result = []
for tweet in df.text:
result.append(tweet_cleaner(tweet))
df['cleaned'] = result
df.head(20)
lang | place | text | word_count | no_punctuation | cleaned | |
---|---|---|---|---|---|---|
0 | en | None | This. https://t.co/UNKep8L5EY | 2 | This httpstcoUNKep8L5EY | this |
1 | en | None | Wanna see England in the final but Croatia is clearly the better side\n\n#ENGCRO #WorldCup | 14 | Wanna see England in the final but Croatia is clearly the better side\n\nENGCRO WorldCup | wanna see england in the final but croatia is clearly the better side engcro worldcup |
2 | en | None | C’mon #England we really can do this. Dig deep. #EnglandvsCroatia #ThreeLionsOnAShirt | 11 | Cmon England we really can do this Dig deep EnglandvsCroatia ThreeLionsOnAShirt | c mon england we really can do this dig deep englandvscroatia threelionsonashirt |
3 | en | None | RT @FlickSaudi: If England lose This is How The Streets Of London Will Look https://t.co/V8AT2RRn5p | 15 | RT FlickSaudi If England lose This is How The Streets Of London Will Look httpstcoV8AT2RRn5p | rt if england lose this is how the streets of london will look |
4 | en | None | RT @hermannkelly: .@TonightShowTV3 I’m into the breach on the TV3 Tonight Show, Wednesday 11th July at 11pm. #WorldCup \nAs we already… | 23 | RT hermannkelly TonightShowTV3 Im into the breach on the TV3 Tonight Show Wednesday 11th July at 11pm WorldCup \nAs we already | rt i m into the breach on the tv tonight show wednesday th july at pm worldcup as we already |
5 | th | None | RT @mthai: เกมส์ยังไม่จบ... ต่อเวลาพิเศษ 30 นาที\n\nโครเอเชีย 🇭🇷 1 : 1 🏴 อังกฤษ\n เปริซิช 68' ⚽… | 47 | RT mthai เกมสยงไมจบ ตอเวลาพเศษ 30 นาท\n\nโครเอเชย 1 1 องกฤษ\n เปรซช 68 | rt |
6 | en | None | William made an England flag at nursery today. I hope he gets the need for it on Sunday. #ENGvCRO… https://t.co/Cjf76wWtPG | 20 | William made an England flag at nursery today I hope he gets the need for it on Sunday ENGvCRO httpstcoCjf76wWtPG | william made an england flag at nursery today i hope he gets the need for it on sunday engvcro |
7 | en | None | This is why england can never win d world cup again argue with ur ancestors #ENGCRO | 16 | This is why england can never win d world cup again argue with ur ancestors ENGCRO | this is why england can never win d world cup again argue with ur ancestors engcro |
8 | en | None | RT @PurelyFootball: Hyde Park when Kieran Trippier scored for England🏴\n\nAbsolute scenes!🍻 https://t.co/SLb6dFcG0g | 12 | RT PurelyFootball Hyde Park when Kieran Trippier scored for England\n\nAbsolute scenes httpstcoSLb6dFcG0g | rt hyde park when kieran trippier scored for england absolute scenes |
9 | en | None | Honestly? I’m ashamed about the protests in England... I’m ashamed we aren’t as organized and doing as much protest… https://t.co/JBLcWtfgsX | 20 | Honestly Im ashamed about the protests in England Im ashamed we arent as organized and doing as much protest httpstcoJBLcWtfgsX | honestly i m ashamed about the protests in england i m ashamed we aren t as organized and doing as much protest |
10 | en | None | RT @SavageLord10: Snap back thou this an old video guyz help me go viral tag @Aylo_SA @tloucolt @shelm_eric @flickice #imsorrychallenge #M… | 22 | RT SavageLord10 Snap back thou this an old video guyz help me go viral tag Aylo_SA tloucolt shelm_eric flickice imsorrychallenge M | rt snap back thou this an old video guyz help me go viral tag sa eric imsorrychallenge m |
11 | en | None | RT @petertimmins3: If Croatia win by cheating tonight, I expect at least 17million people to respect the result. Especially you, @JuliaHB1 | 21 | RT petertimmins3 If Croatia win by cheating tonight I expect at least 17million people to respect the result Especially you JuliaHB1 | rt if croatia win by cheating tonight i expect at least million people to respect the result especially you |
12 | en | None | RT @KEEMSTAR: England wins \n\nI saw it in a dream | 10 | RT KEEMSTAR England wins \n\nI saw it in a dream | rt england wins i saw it in a dream |
13 | en | None | We’re going a bit Spursy here England 🙈 | 8 | Were going a bit Spursy here England | we re going a bit spursy here england |
14 | en | None | RT @Predictionhq: correct score 1-1✔️\nCroatia over 0.5 Team Goals ✔️\nCroatia over 3.5 corners ✔️\nOver 1.5 FT goals \n100% Record https://t.c… | 20 | RT Predictionhq correct score 11\nCroatia over 05 Team Goals \nCroatia over 35 corners \nOver 15 FT goals \n100 Record httpstc | rt correct score croatia over team goals croatia over corners over ft goals record |
15 | en | None | RT @alexandramusic: Literally can’t cope. COME ON ENGLAND !!! 🏴 | 10 | RT alexandramusic Literally cant cope COME ON ENGLAND | rt literally can t cope come on england |
16 | en | None | Genuinely feel sick right now #ENGCRO #WorldCup | 7 | Genuinely feel sick right now ENGCRO WorldCup | genuinely feel sick right now engcro worldcup |
17 | en | None | Okay the pace seems to be slightly better for #ENGCRO - not sure it's going to be enough at this stage though. Typi… https://t.co/CXsEmkrss4 | 24 | Okay the pace seems to be slightly better for ENGCRO not sure its going to be enough at this stage though Typi httpstcoCXsEmkrss4 | okay the pace seems to be slightly better for engcro not sure it s going to be enough at this stage though typi |
18 | en | None | Ok extra time let’s wrap this up and return England to their natural state; moaning into their pints… https://t.co/8qQLHyjHxN | 19 | Ok extra time lets wrap this up and return England to their natural state moaning into their pints httpstco8qQLHyjHxN | ok extra time let s wrap this up and return england to their natural state moaning into their pints |
19 | en | None | I’d love some sideline reporting from @BarstoolBigCat in the next World Cup | 12 | Id love some sideline reporting from BarstoolBigCat in the next World Cup | i d love some sideline reporting from in the next world cup |
Stopwords Removal
#stop = set(stopwords.words('english'))
#df['final'] = df['cleaned'].apply(lambda x: [item for item in x if item not in stop])
#df['text'].apply(lambda x: [item for item in x if item not in stop])
Sentiment Analysis
Using Textblob.sentiment to detect the sentiment of the cleaned tweet. This returns a tuple of polarity and subjectivity. The polarity is indicative of the sentiment, a positive sentiment will have a value closer to 1 while a negative sentiment will be closer to -1.
from textblob import TextBlob
train = df.cleaned[:100]
#classifier = NaiveBayesClassifier(train, format=None)
#train.apply(lambda x: TextBlob(x).sentiment)
df['sentiment'] = df['cleaned'].apply(lambda x: TextBlob(x).sentiment[0])
df['positive'] = df['sentiment'] > 0
df['negative'] = df['sentiment'] < 0
df['neutral'] = df['sentiment'] == 0
labels = ['Negative','Postive','Neutral']
df_summary = pd.DataFrame([df['negative'].sum(), df['positive'].sum(), df['neutral'].sum()], index=labels)
df_summary
0 | |
---|---|
Negative | 47 |
Postive | 124 |
Neutral | 205 |
from matplotlib import pyplot as plt
%matplotlib inline
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 12
fig_size[1] = 9
plt.rcParams["figure.figsize"] = fig_size
df_summary.plot(kind='bar');
fig = plt.gcf()
fig.set_size_inches(12, 9)
from matplotlib import pyplot as plt
plt.scatter(df.index.values, df['sentiment']);
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
max_words=200,
max_font_size=40,
random_state=42
).generate(str(df['cleaned']))
print(wordcloud)
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show();
<wordcloud.wordcloud.WordCloud object at 0x111d36be0>