For today’s challenge, we will start to explore a technique called data augmentation.
During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.
Today’s notebook is available here.
What’s data augmentation
Sometimes, we don’t have enough data to perfectly train a model.
Data augmentation is very useful in these cases. One of the data augmentation techniques used in Natural Language Processing is called subwords.
To better understand it, visit this link.
Get data and do imports
# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
import numpy as np
import pandas as pd
# get data
!wget --no-check-certificate \
-O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P
# define get_data function
def get_data(path):
data = pd.read_csv(path, index_col=0)
return data
#get the data
data = get_data('/tmp/sentiment.csv')
# clone package repository
!git clone https://github.com/vallantin/atalaia.git
# navigate to atalaia directory
%cd atalaia
# install packages requirements
!pip install -r requirements.txt
# install package
!python setup.py install
# import it
from atalaia.atalaia import Atalaia
#def pre-process function
def preprocess(panda_series):
atalaia = Atalaia('en')
# lower case everyting and remove double spaces
panda_series = [atalaia.lower_remove_white(t) for t in panda_series]
# expand contractions
panda_series = [atalaia.expand_contractions(t) for t in panda_series]
# remove punctuation
panda_series = [atalaia.remove_punctuation(t) for t in panda_series]
# remove numbers
panda_series = [atalaia.remove_numbers(t) for t in panda_series]
# remove stopwords
panda_series = [atalaia.remove_stopwords(t) for t in panda_series]
# remove excessive spaces
panda_series = [atalaia.remove_excessive_spaces(t) for t in panda_series]
return panda_series
# preprocess it
preprocessed_text = preprocess(data.text)
Replace regular tokenizer with subwords tokenizer.
# get a smaller vocab size
vocab_size = 1000
# define the maximum size of a subword
sub_length = 5
# out of vocabulary replacement
oov_tok = "<OOV>"
import tensorflow_datasets as tfds
# start tokenize
tokenizer = Tokenizer(num_words=vocab_size,
oov_token=oov_tok)
tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(preprocessed_text,
vocab_size,
max_subword_length=sub_length)
# encode the whole dataset
encoded_texts = [tokenizer.encode(sentence) for sentence in preprocessed_text]
Visualize what’s happening:
# get the first encoded sentence to see what this tokenizer does
view = [tokenizer.decode([i]) for i in encoded_texts[0]]
view = ' '.join(view)
view
>> so there is no way for me plug it in here in us un less i go by con ver ter
Let’s pad everything and split into test and train. To keep things simple, we won’t try to balance the dataset this time.
# pad sequences and split into test and train
max_length = 100
trunc_type='post'
padding_type='post'
# Pad all sentences
sentences_padded = pad_sequences(encoded_texts,
maxlen=max_length,
padding=padding_type,
truncating=trunc_type)
# get the labels
labels = data.sentiment
# Separate out the sentences and labels into training and test sets
size = int(len(data.text) * 0.8)
X_train = sentences_padded[0:size]
X_test = sentences_padded[size:]
y_train = labels[0:size]
y_test = labels[size:]
# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)
Finally, let’s get a smaller dimensional space and compile the model.
# Build network
embedding_dim = 16
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
Train the model.
# train the model
num_epochs = 23
history = model.fit(X_train,
y_train,
epochs=num_epochs,
validation_data=(X_test, y_test))
Plot accuracy and loss.
def plots(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
plots(history, "accuracy")
plots(history, "loss")
Check the loss and the accuracy:
accuracy = model.evaluate(X_test, y_test)[1]
print('Model accuracy is {:.2f}%'.format(accuracy*100))
>> 13/13 [==============================] - 0s 2ms/step - loss: 0.5193 - accuracy: 0.7845
>> Model accuracy is 78.45%