Lima Vallantin
Wilame
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Envie uma mensagem. Quer saber mais sobre mim? Visite a página "Sobre".

Índice

Não se esqueça de compartilhar:

Compartilhar no linkedin
Compartilhar no twitter
Compartilhar no facebook

Não se esqueça de compartilhar:

Compartilhar no linkedin
Compartilhar no twitter
Compartilhar no whatsapp
Compartilhar no facebook

The Semantic Web is something wonderful. This “extension” of the Web uses standards to encourage data sharing and knowledge spread. You must be asking yourself how this is important for a machine learning practitioner and how to unlock its power.

The Semantic Web relies a lot on data annotation and on the creation of a strict vocabulary that helps machines to grasp the connections between ideas, concepts, and the pages themselves.

Technologies such as Resource Description Framework (RDF) and Web Ontology Language (OWL) allow information to be represented via metadata. This is done through the development of concepts represented by ontologies.

But before, a word about semantic web

The semantic web is one of the most ambitious internet projects. But it is definitely a game-changer.

When we talk about the semantic web, we are actually referring to concepts like linked data, knowledge graphs, RDF, and abstract concepts representation that may help computers to understand human knowledge (but it’s not only all of that).

Linked Data as it was primarily imagined may never happen, but its currently implementations can be very useful for machine leaning. Discussions about this theme are running since 1997.

With all the excitement around Deep Learning, it’s easy to forget about projects like this. In 2018 and 2019, Gartner’s Hype Cycle for Emerging Technologies listed Knowledge Graphs as something to keep an eye on.

What’s an ontology?

Ontology is a philosophical concept that deals with the nature of “being something”. On a purely semantic web context, it’s the act of conceptualizing something and defining how these things communicate.

Is this too metaphysic for you?

Don’t worry, here’s an example. If you come from the SEO field, you have surely heard about structured data. When search engines crawl your page, they also look for pieces of code that “explain” them what’s the content of your page.

These pieces of content are usually represented by a json file like this one:

<html>
<head>
<title>Magnolia (movie)</title>
<script type="application/ld+json">
{
    "@context": "https://schema.org/",
    "@type": "Movie",
    "name": "Magnolia",
    "director": {
        "@type": "Person",
        "name": "Paul Thomas Anderson"
    },
    "productionCompany": {
        "@type": "Organization",
        "name": "New Line Cinema"
    }
}
</script>
</head>
<body>
  <h2>Magnolia (1999)</h2>
  <p>
    Magnolia is a film written and directed by Paul Thomas Anderson, released in 1999.
  </p>
  </body>
</html>

Have you noticed that for every piece of content you have to “explain” what’s the type of information? For instance, you must say that the webpage is talking about a movie (“@type”: “Movie”) whose name is “Magnolia” (“name”: “Magnolia”) and that this movie is directed by a person (“@type”: “Person”) called Paul Thomas Anderson (“name”: “Paul Thomas Anderson”).

The types you see here are an abstraction of entities that exist in the real world. Each concept is defined in what’s called an ontology. Ontologies definitions are documented, as well as the connections between them.

For instance, a movie is an ontology that has a director, which is another ontology, a person.

In this case, the concepts of what’s a person or a movie were defined and maintained by schema.org, while other internet players, like Google, agree to use these definitions.

When you create a webpage about a movie, it’s a good practice to include a code snippet with information about the entities present in your text.

Here’s a video that explains a little better what’s an ontology.

Since ontologies create representations and definitions of a knowledge area, they can be used to capture relationships between elements.

Ontologies are already applied in fields like biomedicine and pharmaceuticals, but it can be used in any industry.

Zalando – a fashion company present in European Union – created their own fashion knowledge graph. They use it with NLP to boost search power.

Semantic annotations are not only useful for SEO, but also for another thing…

What’s the big deal about semantic annotations?

Ontologies are not useful only in a web context. As you saw in the video, you can use annotated data and ontologies to create meaningful relationships between information inside a company. It’s a way to give structure to unstructured information.

Data on the Web is everything, but well structured. When you find structure data, you should get it.

Semantically annotated pages may provide you quick and normalized information, without the need of creating complex scraping bots and more complex yet processing code.

Imagine that you have to scrape an e-commerce website. Some of these sites may implement protection against automatic scraping and some others may create a complex and changing DOM, that makes the task of manipulating them almost impossible.

Annotated pages give you the opportunity of getting information that the site’s owner wants to make available for you. Let’s get the example of this Alibaba.com product.

Go ahead and visit the page. Don’t worry, I am not getting anything when you click on the link.

The page seems like any other page.

Now, install the OpenLink Structure Data Sniffer extension for your browser. I am using Chrome. Reload the page and click on the extension. This is what you will see:

Do you understand what I am talking about? This specific site uses JSON-LD and RDFa for annotation and gives you the product description, the product URL, the type of object, the price etc.

This is very handy! Inspect the code to see how the data was inserted on the page.

Everything is there, in the meta tags. No need to build anything too complex.

Take the time to explore other products or other pages. Use the sniffer to discover if these pages have other annotations, if they use only one format or multiple ones etc.

How to extract this information?

there are many ways to extract this data. You may use tools like Selenium or python libraries like BeautifulSoup.

You can go ahead and create a small python script that visits these pages, finds the annotated information, and saves it on a Pandas data frame.

It’s as simple as that.

Some tools to start exploring the semantic web

More tools can be found on the page Semantic Web and Linked Data.

What did you think? Have you already thought about this possibility before? Have you explored this way of extracting data for a project? Tell your experience!

Não se esqueça de compartilhar:

Compartilhar no linkedin
Compartilhar no twitter
Compartilhar no whatsapp
Compartilhar no facebook

Deixe uma resposta