For today’s challenge, we will continue to navigate through our data using exploratory techniques.
During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.
Today’s notebook is available here.
Exploring our data
As said before, until now, we are having pretty poor results with our models. Instead of trying to tweak the model again, let’s learn about data exploration process.
Let’s continue to explore our data and see if we can improve our model later.
Get data and do imports
# imports import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS import numpy as np import pandas as pd # get data !wget --no-check-certificate \ -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P # define get_data function def get_data(path): data = pd.read_csv(path, index_col=0) return data #get the data data = get_data('/tmp/sentiment.csv') # clone package repository !git clone https://github.com/vallantin/atalaia.git # navigate to atalaia directory %cd atalaia # install packages requirements !pip install -r requirements.txt # install package !python setup.py install # import it from atalaia.atalaia import Atalaia
Last time, we saw we had what seemed to be an outlier sentence, bigger than the average size. Let’s see it again.
# get a list with all the texts texts = data.text #start atalaia atalaia = Atalaia('en') # get the number of tokens in each sentence # get the lengths lens = [len(atalaia.tokenize(t)) for t in texts] data['lengths'] = lens # plot plt.figure(figsize=(10,10)) plt.boxplot(data.lengths) plt.show()
Most of the sentences are short. They range between ~5 and 25 tokens.
Before defining the method to look for and exclude outliers, it’s a good idea to check if we have a normal distribution here. It seems that data is not normal.
k2, p = scipy.stats.normaltest(lens) alpha = 0.05 if p < alpha: # null hypothesis: the sentences come from a normal distribution print("The null hypothesis can be rejected") else: print("The null hypothesis cannot be rejected") >>> The null hypothesis can be rejected
For the cases where we have a non-Gaussian distribution sample of data, we can use the Interquartile Range (IQR). It is calculated as the difference between the 75th and the 25th percentiles of the data.
# calculate interquartile range q25 = percentile(data.lengths, 25) q75 = percentile(data.lengths, 75) iqr = q75 - q25 # calculate the cutoff cutoff = iqr * 1.5 lower = q25 - cutoff upper = q75 + cutoff # identify outliers texts = data.text outliers = [texts[i] for i, text_len in enumerate(lens) if text_len < lower or text_len > upper ] outliers >>> ["Best I've found so far .... I've tried 2 other bluetooths and this one has the best quality (for both me and the listener) as well as ease of using.", 'I even fully charged it before I went to bed and turned off blue tooth and wi-fi and noticed that it only had 20 % left in the morning.', 'My experience was terrible..... This was my fourth bluetooth headset and while it was much more comfortable than my last Jabra (which I HATED!!!', 'But now that it is "out of warranty" the same problems reoccure.Bottom line... put your money somewhere else... Cingular will not support it.', "Bland... Not a liking this place for a number of reasons and I don't want to waste time on bad reviewing.. I'll leave it at that...", 'As for the "mains also uninspired.\t0\nThis is the place where I first had pho and it was amazing!!\t1\nThis wonderful experience made this place a must-stop whenever we are in town again.\t1\nIf the food isn\'t bad enough for you, then enjoy dealing with the world\'s worst/annoying drunk people.\t0\nVery very fun chef.\t1\nOrdered a double cheeseburger & got a single patty that was falling apart (picture uploaded) Yeah, still sucks.\t0\nGreat place to have a couple drinks and watch any and all sporting events as the walls are covered with TV\'s.\t1\nIf it were possible to give them zero stars, they\'d have it.\t0\nThe descriptions said "yum yum sauce" and another said "eel sauce yet another said "spicy mayo"...well NONE of the rolls had sauces on them.', 'This is was due to the fact that it took 20 minutes to be acknowledged then another 35 minutes to get our food...and they kept forgetting things.', 'a drive thru means you do not want to wait around for half an hour for your food but somehow when we end up going here they make us wait and wait.', "Paying $7.85 for a hot dog and fries that looks like it came out of a kid's meal at the Wienerschnitzel is not my idea of a good meal.", 'So good I am going to have to review this place twice - once hereas a tribute to the place and once as a tribute to an event held here last night.', 'The problem I have is that they charge $11.99 for a sandwich that is no bigger than a Subway sub (which offers better and more amount of vegetables).']
Notice that one of the observations is not a single sentence! If you take a look at the 6th one, you will see that it contains multiple sentences and multiple sentiments (\t0 or \t1).
Let’s check what’s the sentiment assigned to this long “sentence” in our dataset
data[data.text == outliers]
As you can see:
- We have 8 individual sentences in a single row.
- 3 of them are positive
- But all of them are considered as negative…
What ifthis is happening in other rows too? Let’s try to find other rows with similar problem.
data[data.text.str.contains('\t1|\t0', regex= True, na=False)]
So, only this row has this problem. Let’s exclude it to keep it simple.
data = data.drop(index = )
If we replot the boxplot, we can see that the really long sentence is gone. We didn’t get rid of the other ones, that’s why you still can see the 4 dots on the top of the box.
# plot plt.figure(figsize=(10,10)) plt.boxplot(data.lengths) plt.show()