#100daysoftensorflow

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Tweak the model

For today’s challenge, we will try to tweak the model to see if we get better results.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Why tweak the model?

It’s very rare that you get great results at your first try. So, sometimes you will have to do small changes in data or in your model to improve model’s performance.

Today, we will try to change 4 parameters in our model:

  • vocab_size
  • embedding_dim
  • max_length
  • some model layers

Redo the pre-processing

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

#def pre-process function
def preprocess(panda_series):
  atalaia = Atalaia('en')

  # lower case everyting and remove double spaces
  panda_series = (atalaia.lower_remove_white(t) for t in panda_series)

  # expand contractions
  panda_series = (atalaia.expand_contractions(t) for t in panda_series)

  # remove punctuation
  panda_series = (atalaia.remove_punctuation(t) for t in panda_series)

  # remove numbers
  panda_series = (atalaia.remove_numbers(t) for t in panda_series)

  # remove stopwords
  panda_series = (atalaia.remove_stopwords(t) for t in panda_series)

  # remove excessive spaces
  panda_series = (atalaia.remove_excessive_spaces(t) for t in panda_series)

  return panda_series

# preprocess it
preprocessed_text = preprocess(data.text)

# assign preprocessed texts to dataset
data['text']      = list(preprocessed_text)

# split train/test
# shuffle the dataset
data = data.sample(frac=1)

# separate all classes present on the dataset
classes_dict = {}
for label in [0,1]:
  classes_dict[label] = data[data['sentiment'] == label]

# get 80% of each label
size = int(len(classes_dict[0].text) * 0.8)
X_train = list(classes_dict[0].text[0:size])      + list(classes_dict[1].text[0:size])
X_test  = list(classes_dict[0].text[size:])       + list(classes_dict[1].text[size:])
y_train = list(classes_dict[0].sentiment[0:size]) + list(classes_dict[1].sentiment[0:size])
y_test  = list(classes_dict[0].sentiment[size:])  + list(classes_dict[1].sentiment[size:])

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

Let’s change the vocab_size, embedding_dim and max_length to see if it affects performance. We will decrease the vocab_size, the dimensions and max_length.

# get a smaller vocab size
vocab_size = 500
oov_tok = "<OOV>"

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 
                      oov_token=oov_tok)

# fit on training
# we don't fit on test because, in real life, our model will have to deal with
# words ir never saw before. So, it makes sense fitting only on training.
# when it finds a word it never saw before, it will assign the 
# <OOV> tag to it.
tokenizer.fit_on_texts(X_train)

# get the word index
word_index = tokenizer.word_index

# transform into sequences
# this will assign a index to the tokens present on the corpus
sequences = tokenizer.texts_to_sequences(X_train)

# we will also decrease the max_length 
max_length = 50

# post: pad or truncate after sentence.
# pre: pad or truncate before sentence.
trunc_type='post'
padding_type='post'

padded = pad_sequences(sequences,
                       maxlen=max_length, 
                       padding=padding_type, 
                       truncating=trunc_type)

# tokenize and pad test sentences
# thse will be used later on the model for accuracy test
X_test_sequences = tokenizer.texts_to_sequences(X_test)

X_test_padded    = pad_sequences(X_test_sequences,
                                 maxlen=max_length, 
                                 padding=padding_type, 
                                 truncating=trunc_type)

# create the reverse word index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# create the decoder
def text_decoder(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# Build network
embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

Now, we can train the model. Let’s increase the number of epochs.

# train the model
num_epochs = 30
history = model.fit(padded, 
                    y_train, 
                    epochs=num_epochs,
                    batch_size=32,
                    validation_split=0.2,
                    shuffle=True)
    

Now, we can plot accuracy and loss.

def plots(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plots(history, "accuracy")
plots(history, "loss")

Let’s check accuracy. It actually got worse..

accuracy = model.evaluate(X_test_padded, y_test)[1]
print('Model accuracy is {:.2f}%'.format(accuracy*100))

>> 13/13 [==============================] - 0s 3ms/step - loss: 1.8175 - accuracy: 0.7525
>> Model accuracy is 75.25%

Now, let’s try to change the model layers. Let’s use a GlobalAveragePooling1D layer instead of a Flatten layer.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# train the model
num_epochs = 30
history = model.fit(padded, 
                    y_train, 
                    epochs=num_epochs,
                    batch_size=32,
                    validation_split=0.2,
                    shuffle=True)
    

Plot accuracy and loss.

plots(history, "accuracy")
plots(history, "loss")

And recheck the loss and the accuracy:

accuracy = model.evaluate(X_test_padded, y_test)[1]
print('Model accuracy is {:.2f}%'.format(accuracy*100))

>> 13/13 [==============================] - 0s 3ms/step - loss: 0.8918 - accuracy: 0.7850
>> Model accuracy is 78.50%

Again, the model didn’t improve (in comparison to yesterday’s results).

What did we learn?

Tweaking a model is a trial process. Sometimes, you will do changes that won’t improve the model in anything.

Some of these changes may actually decrease performance.

A good practise is documenting the changes you are doing and benchmarking them to see which tweaks produce good results.


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply