I have recently started to work with some social media comments – tweets, to be more specific – and I must confess that cleaning them to extract data is somehow painful.
Yet, it’s necessary. In fact, a report of 2016 made by CrowdFlower1 stated that 60% of the work of a data scientist is spent on cleaning, organizing, and preparing data. Interestingly, 57% of the data scientists consulted said this was the less enjoyable part of the process.
There’s some discussion about how useful this kind of statistics is2, but the point here is that you may hear a lot of negative things about the preprocessing step.
Yet, I like it. It’s painful, but the sensation of imputing a raw string and getting a list of beautiful tokens is very enjoyable.
What kind of information you may get from social media?
Exploring social media is very rich from a human standpoint. Differently from newspapers, blogs, and academic papers, there’s almost no pressure for a more structured way of writing*.
The use of polished language, itself, says a lot about the person who writes. You can easily distinguish tweets coming from brands and from the press from the ones written by teenagers, for instance.
Thanks to tiny keyboards and the highly celebrated multitasking, you will see a lot of abbreviations and new words being created on the fly. Due to the lack of better ways to express emotion when you only dispose of a few characters, you will also see lots of emojis, gifs and veeeeeeeeery looooooong words.
With this in mind, forget grammar and accept that your fancy regexes won’t always be useful. Work with languages like Portuguese and French only to see the “bordel” that it becomes.
Say welcome to characters such as ^, ¨, `, ´. “That’s only a matter of removing them”, you say. Do it and you will see “à” becoming “a” in French and “é” becoming “e” in Portuguese. Needless to say that these are completely different things†. Let’s not even talk about verb conjugation, declinations, compound words, homophones, homonyms…
It’s very rich. And from this richness, comes trouble.
The place of encoding
While working with social media, you may analyze the textual data, the imagetic data, the choice of usernames, the dates, and the metadata associated with a comment (such as the tone choice or the use of structured grammar).
The first thing that you notice when you are dealing with social media comments is that ideas can be conveyed not only through the text itself, but also with images, links, or emojis.
Emojis are figures that represent ideas visually. There were 3,304 emojis in the Unicode Standard, as of March 2020, according to Emojipedia3. And they are all treated as textual characters.
So, when you deal with social media text, you might want to include – or not – emojis in your analysis. From this decision, you might wanna reconsider how you remove punctuation or special characters from text. Because some of these methods could lead to emoji destruction.
While the ASCII table contains only 128 characters, the Unicode standard has millions of them. More precisely, 1,114,112 possible code points to represent visible and invisible characters.
Since people use more and more characters such as emojis, encoding plays an important role in NLP. While textual characters have to be glued together to form words, an emoji itself can replace a whole sentence. Think about “thank you” and “🙏”‡ and how they are more or less the same thing.
Gifs and stickers. What to do with images?
Now that you have decided what to do with emojis, it’s time to figure out what can be done about images.
Think about this tweet:
It consists of an emoji and a gif image. The context here is more than ever important: what was the original message that originated this thread? Is there a way to understand the reason behind sending this image instead of writing?
Sometimes, dealing with images is so hard that the only thing that you can do is to discard the whole tweet. Some social media now allow users to write an alt text to describe what’s on the message. If it makes sense for your analysis, you could benefit from this.
I see misspells everywhere
Typographical errors can be something that you will find a lot. They may come from tiny keyboards, auto correctors, or just from people trying to do many things at the same time.
Misspells create many versions of the same word, which can be problematic if your work consists of getting frequencies for a given word, for instance.
The text below comes from a real tweet, in French, with all its original misspells:
Mdr imagine tu le voit dehors et tout et il te bouscule je lui met une droite pas deux sa va être régler
While this would be the expected version:
(Je suis) Mort de rire. Imagine: tu le vois dehors et tout, et il te bouscule. Je lui mets une droite - pas deux - et ça va être réglé.
For starters, “Mdr” is not actually one word, but three (‘mort de rire‘). Then, you have some problematic conjugations that could be addressed with some stemming.
But the word “sa“, in “sa va” is particularly problematic. “Sa“, in French, means “your” when the object is a feminine word§. In here, the right word to be used is “ça“, forming the expression “ça va“. If for some reason you wanted to check the frequency of “ça” and “sa“, you’d have wrong results.
What could you do? You could try to use regex to catch very obvious mistakes. Some NLP articles like the one written by Ameisen4 proposes to merge all possible typos into one single representation, like “cool”/”kewl”/”cooool”.
What the hell is this tweet about?
Social media threads allow users to engage in conversations about specific themes.
However, sometimes, part of the posted text consists of ideas being shout, without a meaningful connection between them. Like someone who writes about how his coffee is hot, followed by some random tweet about a new movie he wants to see.
And sometimes you get this:
The first tweet is from a user listing the fashion items she wanted to have: a Dior bag, another bag from the Jacquemus brand, and white crocs. We have a mix of references that makes the process of understanding this tweet very hard for someone who doesn’t know anything about purses and fashion brands.
The second tweet is about an episode of a TV show. The user says “tonight, we eat Gucci”. I still don’t quite understand what this means (if you know, please tell me).
Analyzing social media is, sometimes, having to deal with this. How do you do sentiment analysis here? Are these tweets positive or negative? What the hell are people talking about now?
This whole new way of writing, the punctuation missing, the misspells… It all makes it very hard to understand what’s happening.
Just accept the irony…
I think that the biggest irony of social media is having so much data concentred in one place, but just a few of it at your disposal.
When you decide to work with social media, you have to accept that a large preprocessing work is necessary before extracting something meaningful.
Now it’s your turn: what are your strategies when preprocessing social media comments?
- *I am not referring to social networks like LinkedIn, where the need for formal writing is mostly desired.
- †“À”, in French, means sometimes “to” in English, while “a” is the third person singular of the verb “to have” conjugated (Il/elle/on a). In Portuguese, “é” is the third person singular of the verb “to be” conjugated (Ele/ela/a gente é), while “e” means “and”.
- ‡Another problem is that, just like synonyms, this same symbol (🙏) can be used to convey ideas such as gratitude and gratefulness, but it can also have a religious meaning.
- §In some languages, just like French, Portuguese, and German, words may have a gender. This gender may vary between languages. For instance, “girl” is a feminine word in Portuguese and in French, but a neutral gender word in German. “Tree” is a feminine word in Portuguese, but a masculine word in French etc.
- 1.2016 Data Science Report. CrowdFlower; 2016:6-7. Accessed January 14, 2021. https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
- 2.Dodds L. Do data scientists spend 80% of their time cleaning data? Turns out, no? Lost boy. Published January 21, 2020. Accessed January 14, 2021. https://blog.ldodds.com/2020/01/31/do-data-scientists-spend-80-of-their-time-cleaning-data-turns-out-no/
- 3.FAQ. Emojipedia. Accessed January 14, 2021. https://emojipedia.org/faq/
- 4.Ameisen E. How to solve 90% of NLP problems: a step-by-step guide. KD nuggets. Published January 2019. Accessed January 14, 2021. https://www.kdnuggets.com/2019/01/solve-90-nlp-problems-step-by-step-guide.html