#100daysofdata

in #100DaysOfCode, #100DaysOfData, #100DaysOfTensorflow

Create datasets using semantic web

Now you know what the semantic web is, you must be asking yourself how this is important for a machine learning practitioner.

Semantic annotated pages may provide you quick and normalized information, without the need of creating complex scraping bots and more complex yet processing code.

Imagine that you have to scrape an e-commerce website. Some of these sites may implement protection against automatic scraping and some others may create a complex and changing DOM, that makes the task of manipulating them almost impossible.

Annotated pages give you the opportunity of getting information that the site’s owner want to make available for you. Let’s get the example of this Alibaba.com product.

Go ahead and visit the page. Don’t worry, I am not getting anything when you click on the link.

The page seems like any other page.

Create datasets using semantic web

Now, install the OpenLink Structure Data Sniffer extension for your browser. I am using Chrome. Reload the page and click on the extension. This is what you will see:

Create datasets using semantic web

Do you understand what I am talking about? This specific site uses JSON-LD and RDFa for annotation and gives you the product description, the product URL, the type of object, the price etc.

This is very handy! Inspect the code to see how the data was inserted on the page.

Create datasets using semantic web

Everything is there, in the meta tags. No need to build anything too complex.

Take the time to explore other products or other pages. Use the sniffer to discover if these pages have other annotation forms, if they use only one format or multiple ones etc.

During the next days, I will explore data for at least 1 hour per day and post the notebooks, data and models, when they are available, to this repository.


Do you want to connect? It will be a pleasure to discuss Machine Learning with you. Drop me a message on LinkedIn.

Leave a Reply