#100daysoftensorflow

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Tensorflow for text classification

For today’s challenge, let’s move on to the next Tensorflow’s official tutorial and explore a sentiment analysis problem.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Step 1: downloading the dataset and understanding it

Today’s challenge is based on the colab “Text classification with TensorFlow Hub: Movie reviews” proposed by Tensorflow. The original colab can be accessed here.

This tutorial uses data from the IMDB dataset. It contains text of 50,000 movie reviews. We will split them into 60% and 40%, to have 15,000 examples for training, 10,000 examples for validation and 25,000 examples for testing.

There are 2 labels: 0 for a negative sentiment and 1 for positive sentiment.

The training and testing sets are balanced – they contain an equal number of positive and negative reviews.

# imports
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from pprint import pprint

# define function to load the datasets
def load_ds(set_name, train_split, validation_split):
  train_data, validation_data, test_data = tfds.load(
    name=set_name, # in this case, the set name will be "imdb_reviews"
    split=('train[:' + str(train_split) + '%]', 'train[' + str(validation_split) + '%:]', 'test'),
    as_supervised=True)
  
  return train_data, validation_data, test_data
  
# load data
train_data, validation_data, test_data = load_ds('imdb_reviews', 60, 60)

Tensorflow uses then tf.data API to encode their datasets. It allows us to handle big datasets that don’t fit in memory (amongst a lot of other things).

I have avoided using this format on tutorials and preferred to use Python lists and dictionaries or pandas to handle data. But let’s use this time to gain knowledge on the tf.data API.

By the way, the dataset is an iterator. Use the method iter to initialize it and next to do the actual iteration. The batch method will create a set of consecutive examples for you.

def echo_batch(dataset, examples_qty):
  # print data type
  print('Data type:')
  print(type(dataset))

  # print data shape
  print('\nData shape:')
  print(tf.data.experimental.cardinality(dataset))

  # print the texts on the ds
  print('\nTexts:')
  pprint(next(iter(dataset.batch(examples_qty)))[0])

  print('\nLabels:')
  # Now, print the labels on the ds
  pprint(next(iter(dataset.batch(examples_qty)))[1])

# print the first 5 examples and labels
echo_batch(train_data, 5)

On text problems, we usually apply pre-processing. This includes steps such as tokenizing, special character removal, normalization, etc. But let’s keep things simple and focus on one concept at a time. We can revisit pre-processing later.

Step 2: Build the model

The official tutorial includes the concept of transfer learning. It means that you will use a pre-trained model’s weights to ammeliorate the performance of your own model.

This will save you time and resources. To know mode about this concept, read this, and this.

The model will create an embedding layer using google news vector.

Let’s create the model, compile it and train it.

embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()
# compile the model
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

# train the model
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Step 3: Evaluate the model

Evaluation shows an accuracy of 86% on sentiment detection.

# evaluate
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

Conclusion: what we learned today

Today, we saw we can use already trained models to improve our own model. Since training Tensorflow models is expensive in terms of time and equipment, using this concept will save you some time.

We started to explore the tf.data API. Let’s try to dive mode into this API in the next days.

We also started to talk about text-preprocessing. I have created a library for personal use with this finality.

Finally, we have seen that we are able to achieve great results with seemingly simple architectures.


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply