Today’s problem is a continuation of yesterday’s. Yesterday, I used Gensim to create a summary of some internet articles in order to create a business plan.
However, sometimes you can come up with a lot of text. Some of they will make sense and some others will be garbage.
I thought I could use clustering techniques to try to create groups with texts. Algorithm should be able to categorize them by themes.
That being said… During the next days, I will explore data for at least 1 hour per day and post the notebooks, data and models, when they are available, to this repository.
There’s no notebook today :(.
K-means is one of the simplest – yet super used – algorithms used for clusterization. Clustering something is taking some data you don’t actually know how to separate and let the computer figure ways to do it.
Sklearn has a great page showing results achieved by different algorithms on the same groups of data: https://scikit-learn.org/stable/modules/clustering.html.
However, k-means will only work if it’s able to find some “uniformity” in your dataset. For instance, you should have some kind of balanced data, with examples for each theme existing in equivalent number.
If you have 1,000 articles of theme A and 3 of theme B, it will just not work. Also, k-means works better with a low number of clusters. Just don’t try creating 1,000 clusters from your data with k-means.
Today, I will not post any notebook, since the material is more or less sensitive. But I may show my findings exploring my text data. There are great tutorials out there of how to use k-means with text. This one, for instance is very simple to understand.
That being said…
K-means just didn’t work for me. Actually, no algorithm worked because I was trying to cluster into different themes some sentences that were already summarized yesterday.
Therefore, the data is somehow already clustered! Why? Because when you summarize text, you expect the result to be already a homogeneous group. You will see it later.
After vectorizing my texts using sklearn.feature_extraction.text.TfidfVectorizer, you can see that the sentences form more or less a homogeneous cluster:
In fact, after training a model using k-means, you will see that it places the center of each cluster very close and it finds very difficult to create new clusters (each center is a red cross):
Just to be sure, I used the Elbow method to determine this optimal value of the number of clusters. It’s called elbow because the optimal number of clusters should correspond to the elbow of the arm. Notice that we have no arm in our data:
A “good” elbow graph would be this:
I have also tried to use LDA to create clusters but the result was equally poor.
In a certain way, it comforts me, since it shows that there’s a certain consistency in data I am analysing.