#100daysoftensorflow

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Pre-processing text

For today’s challenge, we will work with text pre-processing. We will pre-process some Amazon and Yelp reviews.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Why is pre-processing important?

Tensorflow is capable of building models for sentiment analysis, text summarization, translation etc.

But such as other types of data, text has to be pre-processed.

Pre-processing is the same as normalizing your data. You are telling the computer that some tokens are the same. You are also removing noise and helping the machine to pick what’s really important on the dataset.

One example of text normalization is setting everything to lowercase. Example:

# for Python, "Mary" and "mary" are two different entities
print('Is "Mary" the same as "mary"?')
print('Mary' == 'mary')
>> False

# now, let's remove the case of Mary
print('Is "Mary.lower()" the same as "mary"?')
print('Mary'.lower() == 'mary')
>> True

This is only one example of pre-processing.

If you are dealing with other language than English, you could remove accents, for example.

Do the imports

As usual, we start by importing all the libraries that we will need.

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
import pandas as pd

Get the data

Let’s use the Amazon and Yelp reviews dataset. You can see it here.

!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

Load the data using Pandas.

# define get_data function
def get_data(path):
  data= pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

#check the data
data.head(5)

This dataset has 1992 reviews. It has 996 negative and positive reviews.

Now, we can preprocess it using atalaia. Let’s import it to the Colab.

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

Define the pre-processing function

Our preprocess function will:

  • Lowercase everything
  • Expand contractions such as i’ll -> i will
  • Remove punctuation from sentences.
    ⚠️ Punctuation is sometimes important for sentiment analysis problems.
  • Remove numbers
  • Remove common stop words such as ‘the’ from corpus
  • After removals, we can find blank spots on sentences. Let’s remove these.

Last step is assigning the pre-processed text to the corresponding dataset column.

def preprocess(panda_series):
  atalaia = Atalaia('en')

  # lower case everyting and remove double spaces
  panda_series = (atalaia.lower_remove_white(t) for t in panda_series)

  # expand contractions
  panda_series = (atalaia.expand_contractions(t) for t in panda_series)

  # remove punctuation
  panda_series = (atalaia.remove_punctuation(t) for t in panda_series)

  # remove numbers
  panda_series = (atalaia.remove_numbers(t) for t in panda_series)

  # remove stopwords
  panda_series = (atalaia.remove_stopwords(t) for t in panda_series)

  # remove excessive spaces
  panda_series = (atalaia.remove_excessive_spaces(t) for t in panda_series)

  return panda_series

# preprocess it
preprocessed_text = preprocess(data.text)

# assign preprocessed texts to dataset
data['text']      = list(preprocessed_text)

# see data
data.head(5)

⚠️ Atalaia is a personal package I have created to studying purposes and is NOT READY FOR PRODUCTION.

Separate train and test

Let’s get 80% of the sentences for training and 20% for testing.

Let’s also try to keep the same amount of labels for each one of them:

  • 50% of negatives
  • 50% of positives
# shuffle the dataset
data = data.sample(frac=1)

# separate all classes present on the dataset
classes_dict = {}
for label in [0,1]:
  classes_dict[label] = data[data['sentiment'] == label]

# get 80% of each label
size = int(len(classes_dict[0].text) * 0.8)
X_train = list(classes_dict[0].text[0:size])      + list(classes_dict[1].text[0:size])
X_test  = list(classes_dict[0].text[size:])       + list(classes_dict[1].text[size:])
y_train = list(classes_dict[0].sentiment[0:size]) + list(classes_dict[1].sentiment[0:size])
y_test  = list(classes_dict[0].sentiment[size:])  + list(classes_dict[1].sentiment[size:])

# print the lengths
print('X_train len is {}'.format(len(X_train)))
print('y_train len is {}'.format(len(y_train)))
print('X_test len is {}'.format(len(X_test)))
print('y_test len is {}'.format(len(y_test)))

# print X_train first sentence and its label
print(X_train[0])
print(y_train[0])

# print X_test first sentence and its label
print(X_test[0])
print(y_test[0])

In order to use this data for training, we need to convert the labels to Numpy arrays.

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

Tokenizing

The next step of the pre-processing phase is tokenizing the corpus.

Tokenization is the same as dividing a sentence into smaller pieces of information. Generally, we consider a word as being one token.

Tensorflow provides a Tokenizer you can use right away.

You must provide:

  1. The size of the vocabulary to keep
  2. A name that will be used to assign every token that is outside boundaries.
# Let's consider the vocab size as the number of words
# that compose 90% of the vocabulary
atalaia    = Atalaia('en')
vocab_size = len(atalaia.representative_tokens(0.9, 
                                               ' '.join(X_train),
                                               reverse=False))
oov_tok = "<OOV>"

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 
                      oov_token=oov_tok)

# fit on training
# we don't fit on test because, in real life, our model will have to deal with
# words ir never saw before. So, it makes sense fitting only on training.
# when it finds a word it never saw before, it will assign the 
# <OOV> tag to it.
tokenizer.fit_on_texts(X_train)

# get the word index
word_index = tokenizer.word_index

# transform into sequences
# this will assign a index to the tokens present on the corpus
sequences = tokenizer.texts_to_sequences(X_train)

# see the first sequence
sequences[0]

Pad sentences

Every sentence you use to feed your network has to have the same length. One sentence cannot be shorter (or longer) than another one.

To make all sentences the same length, we pad them.

We define a max length number.

  • Every sentence longer than this number will be truncated.
  • Every sentence shorter than this number will be padded.
# define max_length 
max_length = 100

# post: pad or truncate after sentence.
# pre: pad or truncate before sentence.
trunc_type='post'
padding_type='post'

padded = pad_sequences(sequences,
                       maxlen=max_length, 
                       padding=padding_type, 
                       truncating=trunc_type)

# tokenize and pad test sentences
# thse will be used later on the model for accuracy test
X_test_sequences = tokenizer.texts_to_sequences(X_test)

X_test_padded    = pad_sequences(X_test_sequences,
                                 maxlen=max_length, 
                                 padding=padding_type, 
                                 truncating=trunc_type)

# check the first padded sentence. Notice that 0s were added to it
# because it was shorter than 100
padded[0]

Today’s last step is the creation of a “decoder” function. The decoder will read the padded sequence and will “transform” it to a real textual sentence.

Our model will output sequences of numbers. We will need to translate them later to real sentences.

We do it by creating a reverse word dict based on the word_index dictionary we just got after fitting the tokenizer.

# create the reverse word index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# create the decoder
def text_decoder(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# print the decoder output for one sentence and compare it to original
print('Decoded sentence:')
print(text_decoder(padded[1]))
print('\nOriginal sentence')
print(X_train[1])

Example of output:

>> Decoded sentence:
>> i live in neighborhood so i am disappointed i will not be back here because it is convenient location ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

>> Original sentence
>> i live in neighborhood so i am disappointed i will not be back here because it is convenient location

What we learned today

Every type of data has to be pre-processed. Today we only did it to text, but the same concept applies to images, videos, documents…

Images, for instance, can be rotated, cropped, resized, colored, transformed to b&w and so on.

If you want to read more about pre-preprocessing, read this article.


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply