Lima Vallantin
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.


Don't forget to share:

Share on linkedin
Share on twitter
Share on facebook

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

For today’s challenge, we will continue to navigate through our data using exploratory techniques. Textual data is so rich that it worths this kind of in deep analysis.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Exploring our data

Today, let’s apply other exploratory techniques to better understand our corpus.

As usual, we will start by importing the packages, excluding stop-words and the extra sentence we saw yesterday.

# imports
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import numpy as np
from numpy import mean
from numpy import std
from numpy import percentile
import pandas as pd
import scipy

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python install

# import it
from atalaia.atalaia import Atalaia

# get a list with all the texts
texts = data.text

#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
data['lengths'] = lens

#delete outliers
data = data.drop(index = [1228])

# lower everything
data['text'] = [atalaia.lower_remove_white(t) for t in data['text']]

# exclude stopwords
data['text'] = [atalaia.remove_stopwords(t) for t in data['text']]

Let’s try to get the most representative words for our positive and for our negative corpus.

def plot_representative_words_for_sentiment(sentences, data):
  # start atalaia
  atalaia = Atalaia('en')

  # transform into corpus
  sentences = atalaia.create_corpus(sentences)

  # get the representative words for 80% of the corpus
  token_data = atalaia.representative_tokens(0.8, 

  full_token_data                     = token_data.items()
  full_token_data_tokens, full_counts = zip(*full_token_data)

  token_data                          = list(full_token_data)[:10]
  tokens, counts                      = zip(*token_data)

  # plot

  # return tokens list
  return full_token_data_tokens

Plot positive first.

  # get sentences
  positive        = list(data[data.sentiment  == 1]['text'])
  positive_tokens = plot_representative_words_for_sentiment(positive,data)

Now, plot negative.

  # get sentences
  negative        = list(data[data.sentiment  == 0]['text'])
  negative_tokens = plot_representative_words_for_sentiment(negative,data)

We saw that some tokens are common for both groups of sentences. “Phone”, for instance, appears a lot on positive and on negative sentences. Same for “place” of for “food”.

Which tokens account for 80% of the sentences and are on both sets of sentences?

A few things to thing about:

  • Should we filter these words from our corpus?
  • Should we exclude punctuation?
  • What other analysis could we make to this corpus?
#get the intersection of negative and positive tokens
intersection = list(set(positive_tokens) & set(negative_tokens))
>>> ['tell',

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Leave a Reply