Lima Vallantin
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.


Don't forget to share:

Share on linkedin
Share on twitter
Share on facebook

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

For today’s challenge, we will start to explore a technique called data augmentation.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

What’s data augmentation

Sometimes, we don’t have enough data to perfectly train a model.

Data augmentation is very useful in these cases. One of the data augmentation techniques used in Natural Language Processing is called subwords.

To better understand it, visit this link.

Get data and do imports

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python install

# import it
from atalaia.atalaia import Atalaia

#def pre-process function
def preprocess(panda_series):
  atalaia = Atalaia('en')

  # lower case everyting and remove double spaces
  panda_series = [atalaia.lower_remove_white(t) for t in panda_series]

  # expand contractions
  panda_series = [atalaia.expand_contractions(t) for t in panda_series]

  # remove punctuation
  panda_series = [atalaia.remove_punctuation(t) for t in panda_series]

  # remove numbers
  panda_series = [atalaia.remove_numbers(t) for t in panda_series]

  # remove stopwords
  panda_series = [atalaia.remove_stopwords(t) for t in panda_series]

  # remove excessive spaces
  panda_series = [atalaia.remove_excessive_spaces(t) for t in panda_series]

  return panda_series

# preprocess it
preprocessed_text = preprocess(data.text)

Replace regular tokenizer with subwords tokenizer.

# get a smaller vocab size
vocab_size = 1000

# define the maximum size of a subword
sub_length = 5

# out of vocabulary replacement
oov_tok = "<OOV>"
import tensorflow_datasets as tfds

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 

tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(preprocessed_text, 

# encode the whole dataset
encoded_texts = [tokenizer.encode(sentence) for sentence in preprocessed_text]

Visualize what’s happening:

# get the first encoded sentence to see what this tokenizer does
view = [tokenizer.decode([i]) for i in encoded_texts[0]]
view = ' '.join(view)

>> so  there   is  no  way  for  me  plug   it  in  here  in  us  un less  i  go  by  con ver ter

Let’s pad everything and split into test and train. To keep things simple, we won’t try to balance the dataset this time.

# pad sequences and split into test and train
max_length = 100

# Pad all sentences
sentences_padded = pad_sequences(encoded_texts, 

# get the labels
labels  = data.sentiment

# Separate out the sentences and labels into training and test sets
size    = int(len(data.text) * 0.8)

X_train = sentences_padded[0:size]
X_test  = sentences_padded[size:]
y_train = labels[0:size]
y_test  = labels[size:]

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

Finally, let’s get a smaller dimensional space and compile the model.

# Build network
embedding_dim = 16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')


Train the model.

# train the model
num_epochs = 23
history =, 
                    validation_data=(X_test, y_test))

Plot accuracy and loss.

def plots(history, string):
  plt.legend([string, 'val_'+string])
plots(history, "accuracy")
plots(history, "loss")

Check the loss and the accuracy:

accuracy = model.evaluate(X_test, y_test)[1]
print('Model accuracy is {:.2f}%'.format(accuracy*100))

>> 13/13 [==============================] - 0s 2ms/step - loss: 0.5193 - accuracy: 0.7845
>> Model accuracy is 78.45%

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Leave a Reply