#100daysofdata

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Clustering text to find different themes

Today’s problem is a continuation of yesterday’s. Yesterday, I used Gensim to create a summary of some internet articles in order to create a business plan.

However, sometimes you can come up with a lot of text. Some of they will make sense and some others will be garbage.

I thought I could use clustering techniques to try to create groups with texts. Algorithm should be able to categorize them by themes.

That being said… During the next days, I will explore data for at least 1 hour per day and post the notebooks, data and models, when they are available, to this repository.

There’s no notebook today :(.

Meet k-means

K-means is one of the simplest – yet super used – algorithms used for clusterization. Clustering something is taking some data you don’t actually know how to separate and let the computer figure ways to do it.

Sklearn has a great page showing results achieved by different algorithms on the same groups of data: https://scikit-learn.org/stable/modules/clustering.html.

However, k-means will only work if it’s able to find some “uniformity” in your dataset. For instance, you should have some kind of balanced data, with examples for each theme existing in equivalent number.

If you have 1,000 articles of theme A and 3 of theme B, it will just not work. Also, k-means works better with a low number of clusters. Just don’t try creating 1,000 clusters from your data with k-means.

Today, I will not post any notebook, since the material is more or less sensitive. But I may show my findings exploring my text data. There are great tutorials out there of how to use k-means with text. This one, for instance is very simple to understand.

That being said…

K-means just didn’t work for me. Actually, no algorithm worked because I was trying to cluster into different themes some sentences that were already summarized yesterday.

Therefore, the data is somehow already clustered! Why? Because when you summarize text, you expect the result to be already a homogeneous group. You will see it later.

After vectorizing my texts using sklearn.feature_extraction.text.TfidfVectorizer, you can see that the sentences form more or less a homogeneous cluster:

Clustering text to find different themes

In fact, after training a model using k-means, you will see that it places the center of each cluster very close and it finds very difficult to create new clusters (each center is a red cross):

Clustering text to find different themes

Just to be sure, I used the Elbow method to determine this optimal value of the number of clusters. It’s called elbow because the optimal number of clusters should correspond to the elbow of the arm. Notice that we have no arm in our data:

Clustering text to find different themes

A “good” elbow graph would be this:

Image for post
Source: https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f

Insights

I have also tried to use LDA to create clusters but the result was equally poor.

In a certain way, it comforts me, since it shows that there’s a certain consistency in data I am analysing.


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply