For today’s challenge, let’s move on to the next Tensorflow’s official tutorial and explore a sentiment analysis problem.
During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.
Today’s notebook is available here.
Step 1: downloading the dataset and understanding it
Today’s challenge is based on the colab “Text classification with TensorFlow Hub: Movie reviews” proposed by Tensorflow. The original colab can be accessed here.
This tutorial uses data from the IMDB dataset. It contains text of 50,000 movie reviews. We will split them into 60% and 40%, to have 15,000 examples for training, 10,000 examples for validation and 25,000 examples for testing.
There are 2 labels: 0 for a negative sentiment and 1 for positive sentiment.
The training and testing sets are balanced – they contain an equal number of positive and negative reviews.
# imports import tensorflow as tf import tensorflow_hub as hub import tensorflow_datasets as tfds from pprint import pprint # define function to load the datasets def load_ds(set_name, train_split, validation_split): train_data, validation_data, test_data = tfds.load( name=set_name, # in this case, the set name will be "imdb_reviews" split=('train[:' + str(train_split) + '%]', 'train[' + str(validation_split) + '%:]', 'test'), as_supervised=True) return train_data, validation_data, test_data # load data train_data, validation_data, test_data = load_ds('imdb_reviews', 60, 60)
Tensorflow uses then tf.data API to encode their datasets. It allows us to handle big datasets that don’t fit in memory (amongst a lot of other things).
I have avoided using this format on tutorials and preferred to use Python lists and dictionaries or pandas to handle data. But let’s use this time to gain knowledge on the tf.data API.
By the way, the dataset is an iterator. Use the method iter to initialize it and next to do the actual iteration. The batch method will create a set of consecutive examples for you.
def echo_batch(dataset, examples_qty): # print data type print('Data type:') print(type(dataset)) # print data shape print('\nData shape:') print(tf.data.experimental.cardinality(dataset)) # print the texts on the ds print('\nTexts:') pprint(next(iter(dataset.batch(examples_qty)))) print('\nLabels:') # Now, print the labels on the ds pprint(next(iter(dataset.batch(examples_qty)))) # print the first 5 examples and labels echo_batch(train_data, 5)
On text problems, we usually apply pre-processing. This includes steps such as tokenizing, special character removal, normalization, etc. But let’s keep things simple and focus on one concept at a time. We can revisit pre-processing later.
Step 2: Build the model
The official tutorial includes the concept of transfer learning. It means that you will use a pre-trained model’s weights to ammeliorate the performance of your own model.
The model will create an embedding layer using google news vector.
Let’s create the model, compile it and train it.
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1" hub_layer = hub.KerasLayer(embedding, input_shape=, dtype=tf.string, trainable=True) model = tf.keras.Sequential() model.add(hub_layer) model.add(tf.keras.layers.Dense(16, activation='relu')) model.add(tf.keras.layers.Dense(1)) model.summary()
# compile the model model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy']) # train the model history = model.fit(train_data.shuffle(10000).batch(512), epochs=20, validation_data=validation_data.batch(512), verbose=1)
Step 3: Evaluate the model
Evaluation shows an accuracy of 86% on sentiment detection.
# evaluate results = model.evaluate(test_data.batch(512), verbose=2) for name, value in zip(model.metrics_names, results): print("%s: %.3f" % (name, value))
Conclusion: what we learned today
Today, we saw we can use already trained models to improve our own model. Since training Tensorflow models is expensive in terms of time and equipment, using this concept will save you some time.
We started to explore the tf.data API. Let’s try to dive mode into this API in the next days.
We also started to talk about text-preprocessing. I have created a library for personal use with this finality.
Finally, we have seen that we are able to achieve great results with seemingly simple architectures.
Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.