#100daysoftensorflow

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Data Augmentation for NLP

For today’s challenge, we will start to explore a technique called data augmentation.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

What’s data augmentation

Sometimes, we don’t have enough data to perfectly train a model.

Data augmentation is very useful in these cases. One of the data augmentation techniques used in Natural Language Processing is called subwords.

To better understand it, visit this link.

Get data and do imports

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

#def pre-process function
def preprocess(panda_series):
  atalaia = Atalaia('en')

  # lower case everyting and remove double spaces
  panda_series = [atalaia.lower_remove_white(t) for t in panda_series]

  # expand contractions
  panda_series = [atalaia.expand_contractions(t) for t in panda_series]

  # remove punctuation
  panda_series = [atalaia.remove_punctuation(t) for t in panda_series]

  # remove numbers
  panda_series = [atalaia.remove_numbers(t) for t in panda_series]

  # remove stopwords
  panda_series = [atalaia.remove_stopwords(t) for t in panda_series]

  # remove excessive spaces
  panda_series = [atalaia.remove_excessive_spaces(t) for t in panda_series]

  return panda_series

# preprocess it
preprocessed_text = preprocess(data.text)

Replace regular tokenizer with subwords tokenizer.

# get a smaller vocab size
vocab_size = 1000

# define the maximum size of a subword
sub_length = 5

# out of vocabulary replacement
oov_tok = "<OOV>"
import tensorflow_datasets as tfds

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 
                      oov_token=oov_tok)

tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(preprocessed_text, 
                                                                    vocab_size, 
                                                                    max_subword_length=sub_length)

# encode the whole dataset
encoded_texts = [tokenizer.encode(sentence) for sentence in preprocessed_text]

Visualize what’s happening:

# get the first encoded sentence to see what this tokenizer does
view = [tokenizer.decode([i]) for i in encoded_texts[0]]
view = ' '.join(view)
view

>> so  there   is  no  way  for  me  plug   it  in  here  in  us  un less  i  go  by  con ver ter

Let’s pad everything and split into test and train. To keep things simple, we won’t try to balance the dataset this time.

# pad sequences and split into test and train
max_length = 100
trunc_type='post'
padding_type='post'

# Pad all sentences
sentences_padded = pad_sequences(encoded_texts, 
                                 maxlen=max_length, 
                                 padding=padding_type, 
                                 truncating=trunc_type)

# get the labels
labels  = data.sentiment

# Separate out the sentences and labels into training and test sets
size    = int(len(data.text) * 0.8)

X_train = sentences_padded[0:size]
X_test  = sentences_padded[size:]
y_train = labels[0:size]
y_test  = labels[size:]

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

Finally, let’s get a smaller dimensional space and compile the model.

# Build network
embedding_dim = 16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Train the model.

# train the model
num_epochs = 23
history = model.fit(X_train, 
                    y_train, 
                    epochs=num_epochs,
                    validation_data=(X_test, y_test))

Plot accuracy and loss.

def plots(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plots(history, "accuracy")
plots(history, "loss")

Check the loss and the accuracy:

accuracy = model.evaluate(X_test, y_test)[1]
print('Model accuracy is {:.2f}%'.format(accuracy*100))

>> 13/13 [==============================] - 0s 2ms/step - loss: 0.5193 - accuracy: 0.7845
>> Model accuracy is 78.45%


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply