#100daysoftensorflow

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Text pre-processing (part 2)

For today’s challenge, we will continue to work with pre-processed text. Now, it’s time to build the model, train it and do some predictions.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Continue to work on the model

Tensorflow is capable of building models for sentiment analysis, text summarization, translation etc.

Today, we will complete this notebook, buiding the network and training the model.

On the previous notebook, we pre-processed text. Now, it’s time to feed the network with it.

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

# define get_data function
def get_data(path):
  data= pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
#!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
#%cd atalaia

# install packages requirements
#!pip install -r requirements.txt

# install package
#!python setup.py install

# import it
from atalaia.atalaia import Atalaia

#def pre-process function
def preprocess(panda_series):
  atalaia = Atalaia('en')

  # lower case everyting and remove double spaces
  panda_series = (atalaia.lower_remove_white(t) for t in panda_series)

  # expand contractions
  panda_series = (atalaia.expand_contractions(t) for t in panda_series)

  # remove punctuation
  panda_series = (atalaia.remove_punctuation(t) for t in panda_series)

  # remove numbers
  panda_series = (atalaia.remove_numbers(t) for t in panda_series)

  # remove stopwords
  panda_series = (atalaia.remove_stopwords(t) for t in panda_series)

  # remove excessive spaces
  panda_series = (atalaia.remove_excessive_spaces(t) for t in panda_series)

  return panda_series

# preprocess it
preprocessed_text = preprocess(data.text)

# assign preprocessed texts to dataset
data['text']      = list(preprocessed_text)

# split train/test
# shuffle the dataset
data = data.sample(frac=1)

# separate all classes present on the dataset
classes_dict = {}
for label in [0,1]:
  classes_dict[label] = data[data['sentiment'] == label]

# get 80% of each label
size = int(len(classes_dict[0].text) * 0.8)
X_train = list(classes_dict[0].text[0:size])      + list(classes_dict[1].text[0:size])
X_test  = list(classes_dict[0].text[size:])       + list(classes_dict[1].text[size:])
y_train = list(classes_dict[0].sentiment[0:size]) + list(classes_dict[1].sentiment[0:size])
y_test  = list(classes_dict[0].sentiment[size:])  + list(classes_dict[1].sentiment[size:])

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

# Let's consider the vocab size as the number of words
# that compose 90% of the vocabulary
atalaia    = Atalaia('en')
vocab_size = len(atalaia.representative_tokens(0.9, 
                                               ' '.join(X_train),
                                               reverse=False))
oov_tok = "<OOV>"

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 
                      oov_token=oov_tok)

# fit on training
# we don't fit on test because, in real life, our model will have to deal with
# words ir never saw before. So, it makes sense fitting only on training.
# when it finds a word it never saw before, it will assign the 
# <OOV> tag to it.
tokenizer.fit_on_texts(X_train)

# get the word index
word_index = tokenizer.word_index

# transform into sequences
# this will assign a index to the tokens present on the corpus
sequences = tokenizer.texts_to_sequences(X_train)

# define max_length 
max_length = 100

# post: pad or truncate after sentence.
# pre: pad or truncate before sentence.
trunc_type='post'
padding_type='post'

padded = pad_sequences(sequences,
                       maxlen=max_length, 
                       padding=padding_type, 
                       truncating=trunc_type)

# tokenize and pad test sentences
# thse will be used later on the model for accuracy test
X_test_sequences = tokenizer.texts_to_sequences(X_test)

X_test_padded    = pad_sequences(X_test_sequences,
                                 maxlen=max_length, 
                                 padding=padding_type, 
                                 truncating=trunc_type)

# create the reverse word index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# create the decoder
def text_decoder(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

We will build a simple neural network with:

  • A Embedding layer
  • A Flatten layer
  • A first Dense layer
  • An output dense layer
# Build network
embedding_dim = 16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Now, it’s time to fit and train the model.

num_epochs = 10
model.fit(padded, 
          y_train, 
          epochs=num_epochs, 
          validation_data=(X_test_padded, 
                           y_test))

Let’s check the accuracy.

test_loss, test_acc = model.evaluate(X_test_padded, y_test, verbose=2)
print('\nModel accuracy: {:.0f}%'.format(test_acc*100))

And do some predictions.

Don’t forget to pre-process the sentences.

# Use the model to predict new reviews   
new_reviews = ['Nothing could smell better than this fragrance.', 
               'Everything was perfect',
               'They respect the environment.', 
               'The cake was a little dry',
               'Everything was terrible.'
               'it didn\'t work as expected']

# preprocess the texts
new_reviews = list(preprocess(new_reviews))
print(new_reviews)

Also, create the padded sequences for these new predictions.

⚠️ Use the same configuration you used before. You also have to use the same tokenizer…

# Create the sequences
padding_type     = 'post'
new_sequences    = tokenizer.texts_to_sequences(new_reviews)
new_padded       = pad_sequences(new_sequences, 
                                 padding=padding_type, 
                                 maxlen=max_length)           

# predict
y_pred           = model.predict(new_padded)

The predictions are on an array. Each element of the array is a probability of the sentence being positive or negative. The lesser the probability, the most negative the model thinks it is.

# See the predictions
for x in range(len(new_reviews)):
  print(new_reviews[x])
  print(y_pred[x])
  print('\n')

What we learned today

Today we basically revisited the basics of building a model, but the interesting thing to retain is that you can always tweak your data to try to getter better results.

If you are getting low results, before tweaking your model, try to clean your data or explore methods such as data augmentation or data balancing to see if results improve.


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply