Lima Vallantin
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.


Don't forget to share:

Share on linkedin
Share on twitter
Share on facebook

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

For today’s challenge, we will start to navigate through our data using exploratory techniques.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Exploring our data

Until now, we are having pretty poor results with our models. Instead of trying to tweak the model again, let’s learn about data exploration process.

Usually, this is done before even starting to build a model. Exploration will help you to decide how you should preprocess the text and may give you important insights about your data.

Let’s begin by downloading our data and importing packages.

Get data and do imports

# imports
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python install

# import it
from atalaia.atalaia import Atalaia

Now, let’s check a few basic stats for this data.First thing is understand the average size of our sentences.

# get a list with all the texts
texts = data.text
#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
# plot
plt.hist(lens, bins=30)

Most of the sentences are short. They range between ~5 and 25 tokens.

We can use bloxplot to better visualize the size of the sentences and to spot outliers.


Which words account for 30% of the text?

#create corpus
corpus = atalaia.create_corpus(texts)
# let's lowercase everything first
texts_lower = atalaia.lower_remove_white(corpus)
# plot
token_data = atalaia.representative_tokens(0.3, 

token_data     = token_data.items()
token_data     = list(token_data)[:10]
tokens, counts = zip(*token_data)

# plot

If we remove the stop words, which words are the most representative?

# let's remove the stop words...
texts_no_stopwords = atalaia.remove_stopwords(texts_lower)
# and the punctuation.
texts_no_stopwords = atalaia.remove_punctuation(texts_no_stopwords)

token_data_no_stop = atalaia.representative_tokens(0.3, 

token_data_no_stop = token_data_no_stop.items()
token_data_no_stop = list(token_data_no_stop)[:10]
tokens, counts     = zip(*token_data_no_stop)

# plot

We start to see some words carrying sentiments with them, like “good” or “great”. Not itself is not a good way to determinate sentiment, as it may change the next word meaning. Example:

  • This is not bad -> Good sentiment
  • This is not good -> Bad sentiment

What about hapaxes? These are words that occurs only once within a context. In this case, we will look for hapaxes in the whole corpus. Do we have them?

hapaxes = atalaia.hapaxes(texts_no_stopwords)
print('Found {} hapaxes. Showing the first 50.'.format(len(hapaxes)))


Texts from internet are sometimes mispelled. Some of them can result in hapaxes, like “unacceptible” and “couldnt”.

Another way to visualize things is using wordclouds.

def preprocess(texts):
  # preprocess 
  texts = list(texts)
  # lower
  texts = [atalaia.lower_remove_white(t) for t in texts]
  # remove punctuation
  texts = [atalaia.remove_punctuation(t) for t in texts]
  # remove numbers
  texts = [atalaia.remove_numbers(t) for t in texts]
  # remove stopwords
  texts = [atalaia.remove_stopwords(t) for t in texts]
  # tokenize 
  texts = [atalaia.tokenize(t) for t in texts]

  return texts

texts_preprocessed = preprocess(texts)
# preview

>>> [['no', 'way', 'plug', 'us', 'unless', 'go', 'converter'],
 ['good', 'case', 'excellent', 'value'],
 ['great', 'jawbone'],
 ['mic', 'great']]
# wordcloud
def gen_wordcloud(texts_preprocessed):
  wordcloud = WordCloud(background_color='white',
  fig = plt.figure(1, figsize=(20, 12))

# generate wordcloud

We will continue to explore our data in the next days.

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Leave a Reply