Lima Vallantin
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.


Don't forget to share:

Share on linkedin
Share on twitter
Share on facebook

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

For today’s challenge, we won’t many different changes from the previous ones we have already made. Let’s try one last thing with our corpus before retraining the model.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Preprocess our data

Now that we have explored our dataset, let’s do a few changes to it and retrain the model.

We will:

  • Exclude the outlier sentence
  • Exclude stop words

By the way: I forgot to mention that I did a few changes on the atalaia package to match more stop words.

In last exercise, we got the tokens that were part, at the same time, of the positive and negative sentences. There were 349 tokens in this situation (excluding stop words). The whole corpus (also excluding stop words) is composed by 3109 tokens.

So, 11% of the tokens are present in both sets… What would happen if we exclude these ambiguous tokens? Could this help our model to better generalize?

Remember: the goal of this challenge is testing and exploring possibilities. Let’s see where this leads us.

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
from numpy import mean
from numpy import std
from numpy import percentile
import pandas as pd
import scipy

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python install

# import it
from atalaia.atalaia import Atalaia

# get a list with all the texts
texts = data.text

#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
data['lengths'] = lens

#delete outliers
data = data.drop(index = [1228])

# lower everything
data['text'] = [atalaia.lower_remove_white(t) for t in data['text']]

# exclude expand contractions
data['text'] = [atalaia.expand_contractions(t) for t in data['text']]

# exclude punctuation
data['text'] = [atalaia.remove_punctuation(t) for t in data['text']]

# exclude numbers
data['text'] = [atalaia.remove_numbers(t) for t in data['text']]

# exclude stopwords
data['text'] = [atalaia.remove_stopwords(t) for t in data['text']]

# exclude excessive spaces
data['text'] = [atalaia.remove_excessive_spaces(t) for t in data['text']]

Get the intersection tokens.

# create function to get positive and negative tokens
def representative_words_for_sentiment(sentences, data):
  # start atalaia
  atalaia = Atalaia('en')

  # transform into corpus
  sentences = atalaia.create_corpus(sentences)

  # get the representative words for 80% of the corpus
  token_data = atalaia.representative_tokens(0.8, 

  full_token_data                     = token_data.items()
  full_token_data_tokens, full_counts = zip(*full_token_data)

  token_data                          = list(full_token_data)[:10]
  tokens, counts                      = zip(*token_data)

  # return tokens list
  return full_token_data_tokens

# get positive sentences
positive        = list(data[data.sentiment  == 1]['text'])
positive_tokens = representative_words_for_sentiment(positive,data)

# get negative sentences
negative        = list(data[data.sentiment  == 0]['text'])
negative_tokens = representative_words_for_sentiment(negative,data)

#get the intersection of negative and positive tokens
intersection = list(set(positive_tokens) & set(negative_tokens))

Now, let’s remove these tokens from the sentences.

def exclude_intersection_tokens(sentence, intersection):
  atalaia = Atalaia('en')
  sentence = [token for token in atalaia.tokenize(sentence) if token not in intersection]
  return ' '.join(sentence)

# get the sentences without these insersection tokens
preprocessed = [exclude_intersection_tokens(sentence, intersection) for sentence in data.text]
>>> ['unless converter',
 'excellent value',
 'tied conversations lasting major',
 'dozen hundred contacts imagine',
 'needless wasted money',
 'waste money']

Now, we have a problem: some of the sentences are empty… It seems that we will have to work with these tokens anyway.

Finally, we got back to the starting point… The only real change we did was augmenting the number of stop words matched and excluding one outlier.

Let’s split the data into train and test again and retrain the model.

# split train/test
# shuffle the dataset
data = data.sample(frac=1)

# separate all classes present on the dataset
classes_dict = {}
for label in [0,1]:
  classes_dict[label] = data[data['sentiment'] == label]

# get 80% of each label
size = int(len(classes_dict[0].text) * 0.8)
X_train = list(classes_dict[0].text[0:size])      + list(classes_dict[1].text[0:size])
X_test  = list(classes_dict[0].text[size:])       + list(classes_dict[1].text[size:])
y_train = list(classes_dict[0].sentiment[0:size]) + list(classes_dict[1].sentiment[0:size])
y_test  = list(classes_dict[0].sentiment[size:])  + list(classes_dict[1].sentiment[size:])

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

# Let's consider the vocab size as the number of words
# that compose 90% of the vocabulary
atalaia    = Atalaia('en')
vocab_size = len(atalaia.representative_tokens(0.9, 
                                               ' '.join(X_train),
oov_tok = "<OOV>"

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 

# fit on training
# we don't fit on test because, in real life, our model will have to deal with
# words ir never saw before. So, it makes sense fitting only on training.
# when it finds a word it never saw before, it will assign the 
# <OOV> tag to it.

# get the word index
word_index = tokenizer.word_index

# transform into sequences
# this will assign a index to the tokens present on the corpus
sequences = tokenizer.texts_to_sequences(X_train)

# define max_length 
max_length = 100

# post: pad or truncate after sentence.
# pre: pad or truncate before sentence.

padded = pad_sequences(sequences,

# tokenize and pad test sentences
# thse will be used later on the model for accuracy test
X_test_sequences = tokenizer.texts_to_sequences(X_test)

X_test_padded    = pad_sequences(X_test_sequences,

# create the reverse word index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# create the decoder
def text_decoder(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# Build network
embedding_dim = 16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')


# train the model
num_epochs = 10, 

Get the predictions and analyse the results.

# predict
y_pred = model.predict(X_test_padded)

# round
y_pred =[1 if y > 0.5 else 0 for y in y_pred]

# confusion matrix
matrix = tf.math.confusion_matrix(y_test, 

matrix = np.array(matrix)

matrix = pd.DataFrame(matrix, 
                      columns=['Positive (real)', 'Negative (real)'],
                      index=['Positive (predicted)', 'Negative (predicted)'])

# accuracy
test_loss, test_acc = model.evaluate(X_test_padded, y_test, verbose=2)
print('\nModel accuracy: {:.0f}%'.format(test_acc*100))
13/13 - 0s - loss: 0.5202 - accuracy: 0.7794

Model accuracy: 78%

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Leave a Reply