Latent Dirichlet Allocation (LDA) is a natural language processing technique used to discover topics present in a certain number of documents.
This process is also called topic modeling and consists of going through a text or a set of documents and grouping sentences (or the documents themselves) into groups. Each group is a topic itself.
In a way, topic modeling works as a clustering process, with LDA being an unsupervised classification task.
Latent Dirichlet Allocation works by analyzing each word of a document and trying to identify which words appear more or less frequently on each document. It’s, to a certain degree, a lexical approach.
In theory, each topic will use more or less certain words. It a set of words appears more frequently than another, we might say that these words are representative of a topic.
The problem with this method is that the order of the words and the context are discarded. Each word is considered as being a pure and disconnected entity in a bag of words.
So, words with the same spelling, but with different meanings are considered as being the same thing, while synonyms are completely ignored, being treated as different things.
Another of its limitations is the need of knowing beforehand how many different topics are present in the text. It’s your task to define an N number of topics.
In order for Latent Dirichlet Allocation to work, you will need to eliminate common words that don’t carry any additional sense with them, the stopwords. To read more about it, you may want to take a look at the article “A Beginner’s Guide to Latent Dirichlet Allocation(LDA)1“.
Does Latent Dirichlet Allocation work?
It depends on what you are trying to achieve. One limitation of the method lies in the fact that you need to know how many topics you may find on a set of documents. Another one comes from the fact that context is lost and unconsidered.
Since this is an unsupervised method, it’s impossible to offer the computer some help and tips on what kind of topics you want to find. Supervised classification methods already start with some clues about which topics are or are not part of a given topic.
Another disadvantage of LDA is that it’s hard to know when things are working and you may find that the topics selected by the model are not very informative. Also, you may have different results for repeated runs.
But truth be told, LDA is cheaper to implement than other machine learning models. It may be a good way to start exploring documents before investing in something more powerful to do topic modeling.
And despite its limitations, LDA is very used for topic modeling. That’s why Rieger, Rahnenführer, and Jentsch consider the method unstable, but propose a method to improve its reliability and reproducibility2 by using a modified version of Jaccard’s coefficient.
In a nutshell, George Ho defends that “LDA and topic modeling doesn’t work well with a) short documents, in which there isn’t much text to model, or b) documents that don’t coherently discuss a single topic.”3
So, when should I use LDA?
LDA may work fine when depending on the number of observations that you have around and on how coherently they are part of well-defined topics. LDA won’t work well with short texts!
You should have a grasp of the topics present in your observations either. For instance, if you want to apply LDA to documents talking about cats and dogs, it’s more or less obvious that you should have 2 clusters of information.
You shouldn’t try to use LDA on topics that are too short or that are just divagations or aleatory considerations about things (yes, I am talking about tweets, Instagram comments, or Reddit comments). LDA works better for long documents, such as newspaper articles or scientific articles.
Why LDA is not good with social media comments?
Short comments are rich in stopwords and these may affect the efficiency of LDA. Social media comments are also the result of aleatory thoughts and are sometimes confuse and unrelated to a specific topic. Since they may also be a response to a previous topic, it’s hard to capture context to make some sense of it.
Don’t forget also that social media is very “imagetic” nowadays and that people may reply to a previous comment with a gif, a sticker, an emoji, or even with a link to something.
- 1.Kulshrestha R. A Beginner’s Guide to Latent Dirichlet Allocation(LDA). Towards Data Science. Published July 19, 2019. Accessed January 13, 2021. https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2
- 2.Rieger J, Rahnenführer J, Jentsch C. Improving Latent Dirichlet Allocation: On Reliability of the Novel Method LDAPrototype. Métais E, Meziane F, Horacek H, Cimiano P, eds. Natural Language Processing and Information Systems. 2020;12089:118-125. doi:10.1007/978-3-030-51310-8_11
- 3.Ho G. Why Latent Dirichlet Allocation Sucks. Eigenfoo. Published March 6, 2018. Accessed January 13, 2021. https://eigenfoo.xyz/lda-sucks/