in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Ensemble modelling

Sometimes, it’s hard to get good predictions with a single model. In these cases, you can train different models with different architectures and submit your predictions to all of them.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

What we’ll do today

In machine learning, ensemble learning uses several learning algorithms to obtain better predictions.

We will test this concept today to see if we can get better predictions.

For this, let’s train 2 models with different configurations. Then, we will submit the same examples for both models and get their predictions. Then, we will extract the mean of these predictions.

Let’s start with imports and preprocessing.

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt

import numpy as np
from numpy import mean
from numpy import std
from numpy import percentile
import pandas as pd
import scipy

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

# define get_data function
def get_data(path):
  data = pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

# get a list with all the texts
texts = data.text

#start atalaia
atalaia = Atalaia('en')

# get the number of tokens in each sentence
# get the lengths
lens = [len(atalaia.tokenize(t)) for t in texts]
data['lengths'] = lens

#delete outliers
data = data.drop(index = [1228])

# lower everything
data['text'] = [atalaia.lower_remove_white(t) for t in data['text']]

# exclude expand contractions
data['text'] = [atalaia.expand_contractions(t) for t in data['text']]

# exclude punctuation
data['text'] = [atalaia.remove_punctuation(t) for t in data['text']]

# exclude numbers
data['text'] = [atalaia.remove_numbers(t) for t in data['text']]

# exclude stopwords
data['text'] = [atalaia.remove_stopwords(t) for t in data['text']]

# exclude excessive spaces
data['text'] = [atalaia.remove_excessive_spaces(t) for t in data['text']]

Then, let’s split the dataset.

# split train/test
# shuffle the dataset
data = data.sample(frac=1)

# separate all classes present on the dataset
classes_dict = {}
for label in [0,1]:
  classes_dict[label] = data[data['sentiment'] == label]

# get 80% of each label
size = int(len(classes_dict[0].text) * 0.8)
X_train = list(classes_dict[0].text[0:size])      + list(classes_dict[1].text[0:size])
X_test  = list(classes_dict[0].text[size:])       + list(classes_dict[1].text[size:])
y_train = list(classes_dict[0].sentiment[0:size]) + list(classes_dict[1].sentiment[0:size])
y_test  = list(classes_dict[0].sentiment[size:])  + list(classes_dict[1].sentiment[size:])

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

# Let's consider the vocab size as the number of words
# that compose 90% of the vocabulary
atalaia    = Atalaia('en')
vocab_size = len(atalaia.representative_tokens(0.9, 
                                               ' '.join(X_train),
oov_tok = "<OOV>"

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 

# fit on training
# we don't fit on test because, in real life, our model will have to deal with
# words ir never saw before. So, it makes sense fitting only on training.
# when it finds a word it never saw before, it will assign the 
# <OOV> tag to it.

# get the word index
word_index = tokenizer.word_index

# transform into sequences
# this will assign a index to the tokens present on the corpus
sequences = tokenizer.texts_to_sequences(X_train)

# define max_length 
max_length = 100

# post: pad or truncate after sentence.
# pre: pad or truncate before sentence.

padded = pad_sequences(sequences,

# tokenize and pad test sentences
# thse will be used later on the model for accuracy test
X_test_sequences = tokenizer.texts_to_sequences(X_test)

X_test_padded    = pad_sequences(X_test_sequences,

And finally, let’s build two different networks and train them.

# Model 1
embedding_dim = 16

model_1 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')


# train the model
num_epochs = 100
history_1 =  model_1.fit(padded, 

# define the plot function
def plots(history, string):
  plt.legend([string, 'val_'+string])
plots(history_1, "accuracy")
plots(history_1, "loss")
Ensemble modelling
Ensemble modelling

For the model 2, we will only add a few dropout layers.

# Model 2
embedding_dim = 16

model_2 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')


# train the model
num_epochs = 100
history_2 = model_2.fit(padded, 

plots(history_2, "accuracy")
plots(history_2, "loss")

Get the predictions and analyse the results.

Ensemble modelling
Ensemble modelling

Now we can calculate the precision and the recall for each one of the models separately.

Then, we will get the average of the predictions made by these two models and compare its recall and precision.

The recall will measure the number of correct positive predictions made out of all possible positive predictions.

The precision will measure the number of correct positive predictions made.

Higher precision will minimize the number of false positives. Higher recall will minimize the number of false negatives.

Since negative comments are more harmful than positive ones, it’s important to minimize the number of false positives (customer is unhappy, but system classify their comment as being a good comment).

# predict
y_pred_1 = model_1.predict(X_test_padded)
y_pred_2 = model_2.predict(X_test_padded)

# get the median
y_pred   = (y_pred_1 + y_pred_2)/2

# round
y_pred   = [1 if y > 0.5 else 0 for y in y_pred]

def get_matrix(y_true, y_pred):
  # confusion matrix
  matrix = tf.math.confusion_matrix(y_true, 

  matrix = np.array(matrix)

  matrix = pd.DataFrame(matrix, 
                        columns=['Positive (predicted)', 'Negative (predicted)'],
                        index=['Positive (real)', 'Negative (real)'])
  # print accuracy
  tp = matrix['Positive (predicted)'][0] #true positives
  tn = matrix['Negative (predicted)'][1] #true negatives
  fp = matrix['Positive (predicted)'][1] #false positives
  fn = matrix['Negative (predicted)'][0] #false negatives

  # get recall
  recall = tp/(tp + fn)
  recall = recall * 100
  print('Recall: {:.2f}%'.format(recall))

  # get precision
  precision = tp/(tp + fp)
  precision = precision * 100
  print('Precision: {:.2f}%'.format(precision))

  return matrix

# print matrix for each one of the models, including the ensemble predictions
matrix = get_matrix(y_test, y_pred)
Ensemble modelling
get_matrix(y_test, y_pred_1)
Ensemble modelling
get_matrix(y_test, y_pred_2)
Ensemble modelling

Isolated, both models have good recall, but a poor precision. However, their average has a pretty good precision and a poor recall.

Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply