Lima Vallantin
Wilame
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Tu peux m'envoyer un message. Pour plus d'informations sur moi, tu peux visiter cette page.

Sommaire

N'oublies pas de partager :

Partager sur linkedin
Partager sur twitter
Partager sur facebook

N'oublies pas de partager :

Partager sur linkedin
Partager sur twitter
Partager sur whatsapp
Partager sur facebook

Have you already considered building graphs from a dataset that you own? Graphs are a great way to discover relationships, but not every data can be represented in this way.

Imagine that you have an e-commerce site with thousands of products. A nice way to discover how similar a product is to another is using graphs.

Let’s take the fashion industry example. I have a dataset here with information about fabrics, who design them, the name of the company that manufactured it, the collection, the composition, if they are certified, the type of cloth you can make with it, the colors and patterns etc.

I have transformed all this into a graph and this is the result (the whole network is huge, so some nodes and edges are hidden for better visualization).

Here, you can see that the fabric “Turo Tweed” is made of tweed and wool blends and can be used to make coats and trousers. It was designed by Turo Fabrics UK, producer of the fabric “Superior Lining”, which’s also a great fit for coats and jackets.

This second cluster has four fabrics with a solid pattern, but with different applications. The “Solid Laguna Jersey” and the “Solid Laguna Jersey (Chocolate)” are great for designing leggings and are part of the same collection. The “Westerly Natural White” and the Sweater Weight Wool Jersey are manufactured by the same company but have different patterns.

Until now, we just analysed a fraction of the network. This is the full picture:

Full fabric network

So, a few tips while you analyse your data:

  • Ask yourself which relationships exist in your data.
  • Pre-process your data so these relationships are expressed into a “Subject -> verb -> object” format.
  • Transform these into a Graph using a package like NetworkX.
  • Use a visualization software (I am using Cytoscape) to explore your network.

Build graphs using text

Sentences are great to express relationships. A standard sentence is composed by a subject, a verb and an object.

If you are able to transform data such as the one we saw before into graphs, so texts should be an easy task for you.

However, when dealing with text, it’s always a good idea to preprocess it.

A great place to start to understand how to accomplish this task is by reading the article “Python NLP Tutorial: Building A Knowledge Graph using Python and SpaCy” written by Marius Borcan.

Follow the steps proposed on the article and change a few things:

  • Look for an article in your language.
  • Describe what’s different in comparison to the English language.
  • If you have not used the same tools, what have you used instead?
  • Think about the entities you are finding and start to create their ontologies in your head.

In my case, instead of using spaCy, I have created my own tagger and tokenizer. This was an “old” project: I did this when I started learning NLP to “live” the whole process.

Other pages you can use to accomplish this task are:

You should ask yourself a few questions:

  • How do I identify a sentence?
  • Where does a sentence end?
  • How do I deal with contractions?

Text is complicated because there’s a lot to think about when you want to extract information from it using machine learning. But using graphs is a good way to convert semantic relationships into analytical data.

There’s a reason why controlled natural languages are based on active voice and small sentences: you can easily convert these structures to graph-like data.

You can read more about controlled languages in the article “Comment la génération automatique de contenu peut renforcer votre stratégie de contenu

N'oublies pas de partager :

Partager sur linkedin
Partager sur twitter
Partager sur whatsapp
Partager sur facebook

Laisser un commentaire