# Exploratory analysis (NLP) Part 2

###### Wilame
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.

#### Don't forget to share:

For today’s challenge, we will continue to navigate through our data using exploratory techniques.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

## Exploring our data

As said before, until now, we are having pretty poor results with our models. Instead of trying to tweak the model again, let’s learn about data exploration process.

Let’s continue to explore our data and see if we can improve our model later.

## Get data and do imports

``````# imports
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \

# define get_data function
def get_data(path):
return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia``````

Last time, we saw we had what seemed to be an outlier sentence, bigger than the average size. Let’s see it again.

``````# get a list with all the texts
texts = data.text
#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
data['lengths'] = lens

# plot
plt.figure(figsize=(10,10))
plt.boxplot(data.lengths)
plt.show()``````

Most of the sentences are short. They range between ~5 and 25 tokens.

Before defining the method to look for and exclude outliers, it’s a good idea to check if we have a normal distribution here. It seems that data is not normal.

``````k2, p = scipy.stats.normaltest(lens)
alpha = 0.05

if p < alpha:  # null hypothesis: the sentences come from a normal distribution
print("The null hypothesis can be rejected")
else:
print("The null hypothesis cannot be rejected")

>>> The null hypothesis can be rejected``````

For the cases where we have a non-Gaussian distribution sample of data, we can use the Interquartile Range (IQR). It is calculated as the difference between the 75th and the 25th percentiles of the data.

``````# calculate interquartile range
q25 = percentile(data.lengths, 25)
q75 = percentile(data.lengths, 75)
iqr = q75 - q25

# calculate the cutoff
cutoff = iqr * 1.5
lower  = q25 - cutoff
upper  = q75 + cutoff

# identify outliers
texts = data.text
outliers = [texts[i] for i, text_len in enumerate(lens) if text_len < lower or text_len > upper ]
outliers

>>> ["Best I've found so far .... I've tried 2 other bluetooths and this one has the best quality (for both me and the listener) as well as ease of using.",
'I even fully charged it before I went to bed and turned off blue tooth and wi-fi and noticed that it only had 20 % left in the morning.',
'My experience was terrible..... This was my fourth bluetooth headset and while it was much more comfortable than my last Jabra (which I HATED!!!',
'But now that it is "out of warranty" the same problems reoccure.Bottom line... put your money somewhere else... Cingular will not support it.',
"Bland... Not a liking this place for a number of reasons and I don't want to waste time on bad reviewing.. I'll leave it at that...",
'As for the "mains also uninspired.\t0\nThis is the place where I first had pho and it was amazing!!\t1\nThis wonderful experience made this place a must-stop whenever we are in town again.\t1\nIf the food isn\'t bad enough for you, then enjoy dealing with the world\'s worst/annoying drunk people.\t0\nVery very fun chef.\t1\nOrdered a double cheeseburger & got a single patty that was falling apart (picture uploaded) Yeah, still sucks.\t0\nGreat place to have a couple drinks and watch any and all sporting events as the walls are covered with TV\'s.\t1\nIf it were possible to give them zero stars, they\'d have it.\t0\nThe descriptions said "yum yum sauce" and another said "eel sauce yet another said "spicy mayo"...well NONE of the rolls had sauces on them.',
'This is was due to the fact that it took 20 minutes to be acknowledged then another 35 minutes to get our food...and they kept forgetting things.',
'a drive thru means you do not want to wait around for half an hour for your food but somehow when we end up going here they make us wait and wait.',
"Paying \$7.85 for a hot dog and fries that looks like it came out of a kid's meal at the Wienerschnitzel is not my idea of a good meal.",
'So good I am going to have to review this place twice - once hereas a tribute to the place and once as a tribute to an event held here last night.',
'The problem I have is that they charge \$11.99 for a sandwich that is no bigger than a Subway sub (which offers better and more amount of vegetables).']``````

Notice that one of the observations is not a single sentence! If you take a look at the 6th one, you will see that it contains multiple sentences and multiple sentiments (\t0 or \t1).

Let’s check what’s the sentiment assigned to this long “sentence” in our dataset

``data[data.text == outliers[5]]``

As you can see:

• We have 8 individual sentences in a single row.
• 3 of them are positive
• But all of them are considered as negative…

What ifthis is happening in other rows too? Let’s try to find other rows with similar problem.

``data[data.text.str.contains('\t1|\t0', regex= True, na=False)]``

So, only this row has this problem. Let’s exclude it to keep it simple.

``data = data.drop(index = [1228])``

If we replot the boxplot, we can see that the really long sentence is gone. We didn’t get rid of the other ones, that’s why you still can see the 4 dots on the top of the box.

``````# plot
plt.figure(figsize=(10,10))
plt.boxplot(data.lengths)
plt.show()``````