in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

More exploratory analysis (NLP) Part 3

For today’s challenge, we will continue to navigate through our data using exploratory techniques. Textual data is so rich that it worths this kind of in deep analysis.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Exploring our data

Today, let’s apply other exploratory techniques to better understand our corpus.

As usual, we will start by importing the packages, excluding stop-words and the extra sentence we saw yesterday.

# imports
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

import numpy as np
from numpy import mean
from numpy import std
from numpy import percentile
import pandas as pd
import scipy

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

# get a list with all the texts
texts = data.text

#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
data['lengths'] = lens

#delete outliers
data = data.drop(index = [1228])

# lower everything
data['text'] = [atalaia.lower_remove_white(t) for t in data['text']]

# exclude stopwords
data['text'] = [atalaia.remove_stopwords(t) for t in data['text']]

Let’s try to get the most representative words for our positive and for our negative corpus.

def plot_representative_words_for_sentiment(sentences, data):
  # start atalaia
  atalaia = Atalaia('en')

  # transform into corpus
  sentences = atalaia.create_corpus(sentences)

  # get the representative words for 80% of the corpus
  token_data = atalaia.representative_tokens(0.8, 

  full_token_data                     = token_data.items()
  full_token_data_tokens, full_counts = zip(*full_token_data)

  token_data                          = list(full_token_data)[:10]
  tokens, counts                      = zip(*token_data)

  # plot

  # return tokens list
  return full_token_data_tokens

Plot positive first.

  # get sentences
  positive        = list(data[data.sentiment  == 1]['text'])
  positive_tokens = plot_representative_words_for_sentiment(positive,data)
More exploratory analysis (NLP) Part 3

Now, plot negative.

  # get sentences
  negative        = list(data[data.sentiment  == 0]['text'])
  negative_tokens = plot_representative_words_for_sentiment(negative,data)
More exploratory analysis (NLP) Part 3

We saw that some tokens are common for both groups of sentences. “Phone”, for instance, appears a lot on positive and on negative sentences. Same for “place” of for “food”.

Which tokens account for 80% of the sentences and are on both sets of sentences?

A few things to thing about:

  • Should we filter these words from our corpus?
  • Should we exclude punctuation?
  • What other analysis could we make to this corpus?
#get the intersection of negative and positive tokens
intersection = list(set(positive_tokens) & set(negative_tokens))
>>> ['tell',

Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply