Lima Vallantin
Wilame
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Tu peux m'envoyer un message. Pour plus d'informations sur moi, tu peux visiter cette page.

Sommaire

N'oublies pas de partager :

Partager sur linkedin
Partager sur twitter
Partager sur facebook

N'oublies pas de partager :

Partager sur linkedin
Partager sur twitter
Partager sur whatsapp
Partager sur facebook

For today’s challenge, we will visualize our model’s weights using a tool called Embedding Projector.

During the next days, I will explore Tensorflow for at least 1 hour per day and post the notebooks, data and models to this repository.

Today’s notebook is available here.

Visualisation with Embedding Projector

Last time, I have talked about the importance of knowing what your model is doing.

Tensorflow has the Embedding Projector tool, which lets you see how the model groups information in a visual way.

Today, we will load the weights and the words found on Amazon reviews and see how they are grouped by the model.

Redo the pre-processing and model training.

Retrain everything, but use 100 dimensions for your vectors this time.

# imports
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import os
import io

import numpy as np
import pandas as pd

# get data
!wget --no-check-certificate \
    -O /tmp/sentiment.csv https://drive.google.com/uc?id=13ySLC_ue6Umt9RJYSeM2t-V0kCv-4C-P

# define get_data function
def get_data(path):
  data= pd.read_csv(path, index_col=0)
  return data

#get the data
data = get_data('/tmp/sentiment.csv')

# clone package repository
!git clone https://github.com/vallantin/atalaia.git

# navigate to atalaia directory
%cd atalaia

# install packages requirements
!pip install -r requirements.txt

# install package
!python setup.py install

# import it
from atalaia.atalaia import Atalaia

#def pre-process function
def preprocess(panda_series):
  atalaia = Atalaia('en')

  # lower case everyting and remove double spaces
  panda_series = (atalaia.lower_remove_white(t) for t in panda_series)

  # expand contractions
  panda_series = (atalaia.expand_contractions(t) for t in panda_series)

  # remove punctuation
  panda_series = (atalaia.remove_punctuation(t) for t in panda_series)

  # remove numbers
  panda_series = (atalaia.remove_numbers(t) for t in panda_series)

  # remove stopwords
  panda_series = (atalaia.remove_stopwords(t) for t in panda_series)

  # remove excessive spaces
  panda_series = (atalaia.remove_excessive_spaces(t) for t in panda_series)

  return panda_series

# preprocess it
preprocessed_text = preprocess(data.text)

# assign preprocessed texts to dataset
data['text']      = list(preprocessed_text)

# split train/test
# shuffle the dataset
data = data.sample(frac=1)

# separate all classes present on the dataset
classes_dict = {}
for label in [0,1]:
  classes_dict[label] = data[data['sentiment'] == label]

# get 80% of each label
size = int(len(classes_dict[0].text) * 0.8)
X_train = list(classes_dict[0].text[0:size])      + list(classes_dict[1].text[0:size])
X_test  = list(classes_dict[0].text[size:])       + list(classes_dict[1].text[size:])
y_train = list(classes_dict[0].sentiment[0:size]) + list(classes_dict[1].sentiment[0:size])
y_test  = list(classes_dict[0].sentiment[size:])  + list(classes_dict[1].sentiment[size:])

# Convert labels to Numpy arrays
y_train = np.array(y_train)
y_test = np.array(y_test)

# Let's consider the vocab size as the number of words
# that compose 90% of the vocabulary
atalaia    = Atalaia('en')
vocab_size = len(atalaia.representative_tokens(0.9, 
                                               ' '.join(X_train),
                                               reverse=False))
oov_tok = "<OOV>"

# start tokenize
tokenizer = Tokenizer(num_words=vocab_size, 
                      oov_token=oov_tok)

# fit on training
# we don't fit on test because, in real life, our model will have to deal with
# words ir never saw before. So, it makes sense fitting only on training.
# when it finds a word it never saw before, it will assign the 
# <OOV> tag to it.
tokenizer.fit_on_texts(X_train)

# get the word index
word_index = tokenizer.word_index

# transform into sequences
# this will assign a index to the tokens present on the corpus
sequences = tokenizer.texts_to_sequences(X_train)

# define max_length 
max_length = 100

# post: pad or truncate after sentence.
# pre: pad or truncate before sentence.
trunc_type='post'
padding_type='post'

padded = pad_sequences(sequences,
                       maxlen=max_length, 
                       padding=padding_type, 
                       truncating=trunc_type)

# tokenize and pad test sentences
# thse will be used later on the model for accuracy test
X_test_sequences = tokenizer.texts_to_sequences(X_test)

X_test_padded    = pad_sequences(X_test_sequences,
                                 maxlen=max_length, 
                                 padding=padding_type, 
                                 truncating=trunc_type)

# create the reverse word index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

# create the decoder
def text_decoder(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

# Build network
embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

# train the model
num_epochs = 10
model.fit(padded, 
          y_train, 
          epochs=num_epochs, 
          validation_data=(X_test_padded, 
                           y_test))

Save the weights and metadata

Now,we can save the metada and the vectors which will be used by the Embedding Projector.

This part is very well explained on this notebook.

# get the weights of the embedding layer
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

We will save the files on a directory called “logs”. Let’s create it.

# create a logs directory 
%mkdir ../logs
%cd ..

Now, we save them…

# Write the embedding vectors and metadata
out_v = io.open('logs/vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('logs/meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

And download them.

# Download the files
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('logs/vecs.tsv')
  files.download('logs/meta.tsv')

Go to http://projector.tensorflow.org/ and load these files. You will upload the vectors and the meta files.

Then, on the left column, click on “Spherize data”

Our model is not yet generalizing well, but you can see that it already aggregates words that it thinks are more negative on one side and the opposite words on the other side.

Most positive words
Most negative words

What we learned today

We continued to explore ways as we could see in details how you model “thinks”.

We also saw that Tensorflow proposes a lot of tools that we can use out of the box.

N'oublies pas de partager :

Partager sur linkedin
Partager sur twitter
Partager sur whatsapp
Partager sur facebook

Laisser un commentaire