Lima Vallantin
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Envie uma mensagem. Quer saber mais sobre mim? Visite a página "Sobre".


Não se esqueça de compartilhar:

Compartilhar no linkedin
Compartilhar no twitter
Compartilhar no facebook

Não se esqueça de compartilhar:

Compartilhar no linkedin
Compartilhar no twitter
Compartilhar no whatsapp
Compartilhar no facebook

Struggling to get good results when training a text classifier’s performance for sentiment analysis? Fear no more! With a few changes, you can start to get better results.

For the past few weeks, I have been working on this special project called Essencialia. This is a website that I have created to display information about the Brazilian Real Estate market, specifically for the city of Aracaju.

For those who are new here, let me explain what exactly I am doing. First, with the help of Selenium, I scrape the content of real estate listings in real estate dedicated sites.

Then, using NLP and Text Mining techniques, I extract information such as sales and rental prices, localization, HOA fees, number of rooms, etc.

To estimate the prices of the houses in a given region, it’s important to know what’s being said about that neighborhood by the press or on social media. Are the majority of the news positive or negative? Are there reports of murders or violence?

This is where a text classifier enters. For the time being, I am only interested in knowing the polarity of the sentiments about a region of the city. In other words, I want to get a grasp of the degree of positivity or negativity.

Got it?

So, now that you know what I am trying to achieve, let’s discuss the difficulties of this problem and how to overcome them.

Improving a text classifier’s performance: it all begins with the domain

The first tip to get better results when working on a text classifier’s performance is to use text-specific of the domain that you are analyzing.

Press usually uses a more elaborated language. The phrasal structures are different from the ones that you see on Twitter. Normally, words are correctly spelled too and the adopted tone is more neutral. Well, sometimes…

Social media, however, is more polarized in terms of sentiment and there’s a lot of gibberish going on there. Be prepared to handle misspells and neologisms.

Another thing is the theme of the conversation. If you are training a text classifier for the biomedical field, it would be a good idea to use biomedical text to train it. In the best of the worlds, you could even use different classifiers for different themes.

But life is tough and data is not just there, sitting around, waiting for you. In my case, I will be using my classifier to handle texts that are more formal, written by the local press. And guess which corpus I have used to train it? Exactly, a Twitter corpus :D. Do what I say, not what I do.

But, why this crazy decision? Why? Whyyyyyy?

This leads is to the next topic…

Find the right data for training. Or collect it yourself

If you are working only with text written in the English language, congratulations: 80% of your problems are solved. If you are dealing with any other language, welcome to hell. It’s not always easy to find material to enhance your text classifier’s performance.

As you should imagine, there are not one billion Brazilian Portuguese datasets available out there. So, you have to do the best that you can with the available data.

For Brazilian Portuguese, there’s this great dataset with 800k Portuguese tweets classified as positive, negative, and neutral​1​. And it will just have to do the job of improving our text classifier’s performance.

Another option is creating your own dataset. This is no easy task: you will have to collect data, find suitable people to help you to classify the training data manually (or you could do a pre-classification with any other available model), create a document to help people understand how to classify ambiguous text…

I don’t have time for this now. Although, I have already done this in the past and it’s a great experience.

Now that you have the data, it’s time to spend some time preprocessing it.

Invest some time preprocessing the text to improve your text classifier’s performance

Preprocessing is a crucial step when dealing with textual data. Text is noisy, heavy, and full of ambiguity. So, it’s important to normalize your text before submitting it for training.

A few preprocessing steps include:

  • Dealing with stop words, numbers, social media hashtags, usernames, URLs, and HTML tags
  • Deal with low-frequency terms
  • Removing personal and sensitive data
  • Expanding contractions (like “won’t” becoming “will not”)
  • Deciding whether use stemming or lemmatization
  • Applying POS tagging
  • Dealing with diacritic signs and punctuation…

Deal with stop words

In fact, every problem is a problem and deserves different normalization. In the article “Why is removing stop words not always a good idea”​2​, for instance, I discuss some cases where you wouldn’t want to remove these words from the text.

And yet, sometimes, in order to improve your text classifier’s performance, you will need to expand your stop words list. In fact, there are several types of stop words:

  • The language ones: words like articles and prepositions
  • Location stop words: just as names for places and cities
  • Time stop words: days of the week, names of the months
  • Numerals: numbers and decimal, thousands, etc. indications
  • Domain-specific: these are the catchy ones. Domain-specific stop words are there that don’t really add any new information to the text, but they belong only to the domain that you are analyzing. In our Real Estate problem, these would be words like “house”, “apartment”, “condo”…

The normalization process should be adapted to your corpus and your goals.

Deal with low-frequency terms

If stop words are a problem because of their high frequency, tokens that don’t appear that much are also a problem. These tokens may be a misspelling or a very specific word that appeared once.

Therefore, for a sentiment analysis problem, they can be removed. Sklearn provides the TfidfVectorizer​3​ which converts the corpus to a TF-IDF matrix. You can use the parameters min_df e max_df to limit the term frequency.

You may decide to not consider any term with a minimum frequency of n and a maximum frequency of m.

Stem and lemmatize

You may notice that, in some cases, stemming and lemmatizing the corpus will help you improve your text classifier’s performance.

According to ManningRaghavan, e Schütze (2008)​4​:

“Stemming usually refers to a crude heuristic process that chops off the ends of words (…) [while] lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word”

If you get the words “studies” and “studying” and stem them, you will get “studi” and “study”. If you lemmatize then, you get the lemma “study”.

While stemming is easier, lemmatization is cleaner. And, yes, you are right, there are a couple of stemmers and lemmatizers for English, but not that many for other languages…

That’s crazy, right?

Since you use TfidfVectorizer, also consider using n-grams

The TfidfVectorizer has a parameter called ngram_range. You can use it to define if you want the vectorizer to account for word combinations instead of using single tokens only.

For instance, if you have the phrase “I love apples and bananas”, a 2-gram approach will consider every combination of 2 tokens:

[("I", "love"),

You may also want to consider using collocations, which will merge tokens that appear commonly together, such as “single_bedroom” or “beautiful_house”.

Be careful, although, if you have removed stop words, numbers, or other elements from the text. Why? Because you may start merging tokens that originally didn’t appear together.

Some transformations require a specific order

While preprocessing is necessary, it cannot be done in any order. You won’t want to remove punctuation before removing URLs if you use regex to capture the latter. If you remove punctuation from the URL “” you will get “https vallant in”.

If you merge collocations before extracting stop words, some stop words may “glue” to important terms. If you remove them first, you may end up with fake collocations.

Preprocessing order is very important and should be done in a planned order. Read the documentation for the library that you are using.

Weng (2019)​5​ suggests the preprocessing order below:

  • Remove HTML tags
  • Remove extra whitespaces
  • Convert accented characters to ASCII characters
  • Expand contractions
  • Remove special characters
  • Lowercase all texts
  • Convert number words to numeric form
  • Remove numbers
  • Remove stopwords
  • Lemmatization

Mayo​6​ suggests that the 3 main components of text preprocessing are be

  • Tokenization
  • Normalization
  • Substitution

Don’t go crazy removing everything

Sometimes, especially in the social media case, if you remove things such as usernames, hashtags and URLs you end up with nothing. This can be a problem if you have to predict a text composed entirely of these elements, like the tweet “@mary @john #awesome #checkitout”.

A better approach would be replacing these elements with placeholders. After replacement, you could have “user user url hashtag hashtag”.

It’s not the perfect solution because you lose the tweet’s meaning. You could, for instance, find a way to tokenize the hashtags to transform them into words. The approach really depends on your problem.

Use Part of Speech tags

Part of Speech tagging, A.K.A. POS tagging, is the process of assigning a tag to each token explaining its function on the sentence. For instance, a token may be a noun, a verb, an adjective… This helps the classifier to better handle ambiguity.

Some words in Portuguese can be a noun or a verb. It happens in a lot of other languages either, such as English itself. By using POS tagging, the classifier will know the difference between tokens that look the same but have different morphological roles.

Nothing is helping me

If you already tried everything, including training deep learning and generic machine learning models, maybe the problem is your corpus or the question itself can’t be solved using a text classifier.

An example would be to try to find polarity in academic texts. You may find it, but it will be harder to classify an academic text as positive or negative. You may obtain better results using tweets or social media content.

So, what do you think? How do you improve a text classifier’s performance?


  1. 1.
    Portuguese Tweets for Sentiment Analysis: 800k portuguese tweets separated in positive, negative and neutral classes. Portuguese Tweets for Sentiment Analysis | Kaggle. Accessed February 19, 2021.
  2. 2.
    Lima Vallantin W. Why is removing stop words not always a good idea. Medium. Published June 22, 2019. Accessed February 19, 2021.
  3. 3.
    sklearn.feature_extraction.text.TfidfVectorizer. sklearn. Accessed February 19, 2021.
  4. 4.
    D. Manning C, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge University Press; 2008.
  5. 5.
    Weng J. NLP Text Preprocessing: A Practical Guide and Template. Medium. Published August 30, 2019. Accessed February 19, 2021.
  6. 6.
    Mayo M. A General Approach to Preprocessing Text Data. KDnuggets. Accessed February 19, 2021.

Não se esqueça de compartilhar:

Compartilhar no linkedin
Compartilhar no twitter
Compartilhar no whatsapp
Compartilhar no facebook

Deixe uma resposta