Have you ever thought about the amount of data that is generated each day through text? All this available data is very rich, but also, very hard to capture and to extract knowledge. Sometimes, you may have to measure how similar texts are in order to group them. This is called text similarity.
Natural Language Processing (NLP), is the Artificial Intelligence Swiss knife and allows us to transform any text into structured data. When we refer to modern SEO, the technique can be used, for instance, to group similar keywords or search intentions, or even to cluster similar content.
Today, we will be discussing how important is text similarity and why you should include it in your SEO strategy.
What’s text similarity?
Text similarity is a subprocess of NLP. It can be used to find how close one text is from another and can be used, for instance, to group sentences into clusters.
Let’s look at chatbots. In your opinion, would you say that the questions below are similar?
Sentence 1: 'What's the store address?' Sentence 2: 'Where are you located?' Sentence 2: 'How far is this street from my current position?'
In terms of meaning, yes. Both of these sentences express the same idea, in this case, someone who wants to know what’s the location of something. But in terms of word choice, things change.
For a chatbot, it would be crucial to identify that the three questions are about business location. The intention behind them is the same: to find out the address for a store.
You can, but you shouldn’t use only the words in the text to determine if one sentence is similar to another.
As we just saw, sentences with the same meaning (or with a similar meaning) may come in many flavors. You may also have to deal with text misspells, which may have an impact on the evaluation of the similarity (if you are trying to achieve it only by using tokens).
Another way to determine similarity is by focusing on meaning or intentions (don’t forget this word, as it is very important in a modern SEO context). In our example, the meaning or the intention was to know the store address.
To be more clear, when we focus on meaning, we are mostly talking about semantic textual similarity* (STS). When we want to know how similar the tokens are, we are talking about lexical textual similarity† (LTS).
By the way, STS and LTS can give you pretty different results in terms of similarity. Using one or another will depend on your problem.
Lexical and semantic similarity. Are they the same thing?
No. Yet, as Kavita Ganesan1 explains, they can be treated sometimes as if they were. She explains that some people will try to achieve semantic similarity by analyzing lexical similarity.
It makes sense, in a way. If two sentences use very similar tokens, we could assume that they are similar. The theory would be great if it didn’t lead us to some problems. For example:
Sentence 1: 'It is raining cats and dogs today' Sentence 2: 'Cats and dogs don't like when it is raining'
Are these sentences similar? It depends: they are lexically close, but not semantically. Depending on what you are trying to achieve, you may stick to lexical similarity instead of trying to find semantic similarity.
If you are working on a problem inside a very well defined context, chances are that repetitive tokens will likely have the same meaning. But still, you can’t avoid the occurrence of homonyms (tokens with the same spelling, but different meaning) even though the context is the same.
Even in these cases, problems may occur. Observe the use of the token “play” in this sentence:
I am going to play my new game while my parents are at play.
We could divide this into two sentences:
Sentence 1: I am going to play with my new game Sentence 2: My parents are at play.
These two sentences came from the same context, they have the same token, but they have different meanings. That’s why, sometimes, lexical similarity should not be used – alone – to determine semantic similarity.
When you read or listen to someone talk about text similarity, they are normally referring to lexical similarity. It’s important to pay attention to the difference between these two levels of similarity because the applications may vary.
Why finding text similarity is important?
Now that you know that lexical and semantic similarity are two different levels of text similarity, we can start to discuss the application of this technique.
In the article “Semantic Textual Similarity Methods, Tools, and Applications: A Survey”2, the authors explain that measuring STS is important to tasks such as “document summarization, word sense disambiguation, short answer grading, information retrieval, and extraction”.
Identifying how two questions are similar is an important task for chatbots, but also for search engines. A search engine will try to do its maximum to pair the search query to the indexed content. So, it’s easy to understand how important it is to be able to measure STS.
Bert has been powering Google requests for more than one year now. Did you know that it can be used to determine how similar two sentences are? A practical example of how to accomplish it with Keras and python is available here.
STS could potentially be used for product recommendation and for plagiarism identification. From a legal standpoint, the technique can also be used to identify document similarity.
How to measure the similarity of two documents?
There are several ways to calculate similarity. Some of them are the Jacquard similarity, embeddings, k-means, cosine similarity, Word2Vec, Latent Dirichlet Allocation, latent semantic indexing, word moving distance, Bert embeddings, knowledge-based measures, etc.
Please, notice that some of these measures may be used together, like the combination Different embeddings & Siamese Manhattan LSTM and others available here: “Text Similarities: Estimate the degree of similarity between two texts“3.
The best approach to measure similarity depends on the problem that you are trying to solve, remembering that approaches like the Jacquard measure are good if you need to find basic lexical similarity, while vectors are a better way to catch semantic similarity.
But some similarity approaches can be considered naïve, such as the Jaccard similarity. Let’s see why…
The Jaccard similarity: a simple approach
Jaccard similarity focus on measuring the size of the intersection of two sets. Think about our cats and dogs example. Let’s normalize the sentences below, by lowering the characters and by stemming tokens. If you don’t remove the stopwords, you will have:
Sentence 1: 'it is rain cat and dog today' Sentence 2: 'cat and dog do not like when it rain'
By Jaccard’s theory, in order to measure similarity, we have to find the common tokens for both sentences (the intersection) and then divide them by the union of the whole set.
Schematically, you would have something like:
And the python implementation for this would be:
def jaccard_similarity(first_sentence, second_sentence): intersection = set(first_sentence.split(' ')).intersection(set(second_sentence.split(' '))) union = set(first_sentence.split(' ')).union(set(second_sentence.split(' '))) return len(intersection)/len(union) #execute first_sentence = 'it is rain cat and dog today' second_sentence = 'cat and dog do not like when it rain' jaccard_similarity(first_sentence, second_sentence) #result: '0.45454545454545453'
Notice a few things:
- If we don’t preprocess the sentences, the similarity measure can be lower than it really is. That’s why we removed the caps of the two sentences. But we could go further and remove stopwords, for instance. In this case, the similarity measure would change even more.
- This method does not consider the order of the words to decide if the sentences are similar.
- There’s no accountability of context with this method.
- If synonyms are used, you may receive a low score from the algorithm.
To illustrate these points, look at the example below:
first_sentence = 'France is an European country' second_sentence = 'The Hexagon is located in Europe' jaccard_similarity(first_sentence, second_sentence) #result: '0.1'
As you noticed, this approach is not very accurate and may lead to problems, especially if you are trying to determine semantic similarity.
Word embeddings: a better approach…?
Vectors – or word embeddings – are great to represent text since we are able to capture some degree of context with them. We have already talked about how we can use vectors to infer real-world representations of concepts and ideas.
Vectors work by mapping the position of each token in different sentences and by calculating the probability of finding a given token in the same position.
Complicated? Look at the gap in the example below. Which words would you use to fill it?
I love to hang out with my _____
You could have used “boyfriend”, “friends”, “BFF”, “family”. But you would probably not use “pants” or “drill”. On January, 10th, 2021, there were absolutely no results for these two last searches, and about 1,130,000 for the exact match “I love to hang out with friends”.
In our previous example, words such as “friends”, “family”, “boyfriend” etc. would appear very close to each other in a multidimensional space. While “pants” and “drill” would be placed more far way.
It’s almost like you were creating clusters of words… So, you can imagine that if you have words being grouped more or less close to each other, you can also calculate their distances. Therefore, you could assume that distant tokens are less prone to be similar.
Another reason why the vectorized text is great is that vectorization takes into consideration the sentence’s syntactical structure‡. Word embeddings capture not only the lexical similarity but also syntax and some degree of semantics.
But it needs a large set of sentences to be trained and to provide meaningful insights. It is possible to use a corpus of similar documents to get highly specialized word embeddings. For instance, you could train a model using only medical documents.
Meet cosine similarity
The biggest novelty in the vector approach is that traditional ways to measure textual similarity – such as one-hot encoding and bag-of-words – do not capture information about meaning or context. They rely almost uniquely on counting common words present in different texts.
With word embeddings, we can use distance or cosine distance to discover if documents, tokens, or sentences are similar.
It consists basically of measuring the cosine of the angle between two vectors projected in a dimensional space with N dimensions. A smaller angle predicts stronger similarity.
Word2Vec uses neural networks to calculate word embeddings and can use CBOW‡ or skip-gram§ methods to do it. It is fast, but one of its biggest limitations lies in its inability to deal with Out Of Vocabulary words (OOV). W2Vec will give you errors if you try to get statistics for tokens that it doesn’t know.
FastText is an alternative developed by Facebook that solves the problem that I have just mentioned. It’s also capable of processing subword information. It also presents good results when dealing with small datasets.
GloVe was an option developed by Stanford University that solves another problem of Word2Vec: its inability of taking advantage of word co-occurrence on a corpus4.
And what about Bert?
Bert and Word2Vec have the same mother: her name is Google. If you work with SEO, you have probably been presented to Bert already. The biggest difference between them is that Word2Vec (but also GloVe and FastText) is a context-free model, while Bert is a contextual model.
As explained by Devlin and Chang5:
Context-free models (…) generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.
This changes everything because now, the meaning of a word, the entity itself matters. Old models have a hard time catching homonyms, but Bert considers each token in its context, making it easy to identify the semantic aspect of the word.
To test Bert’s power, you can try the Google Natural Language API and check how the model is able to catch different entities with the same spelling.
The API was able to identify that Paris Hilton is a person and that Paris is a place in the sentence “Paris Hilton is spending some time in Paris today”.
From an SEO perspective, this is great because it rewards well-written text instead of gibberish and word repetition that we have been seeing for years.
Think of this as a useful tool to finally determine text similarity with high accuracy: we can present results about Paris, for queries about the city, instead of including results about Paris, the person.
Uses in SEO: how to put all of this together
Modern SEO is all about discovering intentions instead of focusing only on keywords. Nowadays’ search engines focus more on trying to identify concepts, intentions, and entities than just checking if a token is present or not in a document.
It’s no longer recommended to repeat a set of keywords to rank high. Instead, one needs to focus on finding similarities between a query and a document.
Understanding the many ways used by search engines to identify text similarity, it’s possible to optimize a text for:
- A particular search intention
- A particular set of words with lexical and semantical similarities
When search engines get a query, they will attempt – in a nutshell – to:
- Discover what’s the query about
- Decompose the query to see if it’s possible to find known entities
- Match the query with indexed text
- Present the texts that seem to be more similar to the query
There are more steps, of course, but this is the basic idea behind a search engine.
Ideas of projects to use text similarity to complement your SEO strategy
You can use your knowledge about text similarity to identify…:
- …what would be the best queries to focus on based on the content that you already have on your site (measuring the similarity of a query or keyword against your content).
- …the text gaps between your site and your competitors’ sites (how similar is your text compared to X or Y brands).
- …the similarity of a website X compared to yours in order to evaluate if an incoming link from that website could improve your ranks.
- …how similar is your text compared to the text that’s ranking higher for a given search query.
- …if it’s possible to improve your already existing text, by checking if entities are well-defined and if search engines would be able to capture the meaning behind each word that you have used.
At first, the practical use of the semantic similarity may seem confusing and obtuse in SEO. But the important thing is to put yourself in the shoes of the user and the way they understand the content of your site.
Search engines have been making more efforts to understand texts as human beings comprehend them. The era of repetition and keywords is almost over.
Are you ready for this? Do you have other ideas on how to use text similarity for SEO?
- *STS measures basically the distance between the words by their inferred meanings. It tries to find the similarities between concepts, such as a “car” is similar to a “bus” and a “bus” is similar to a “taxi”.
- †LTS will try to measure the similarity of two word sets, being mostly focused towards vocabulary similarity. It will try to find common words instead of common concepts. If two documents share some same words, LTS will capture the similarity between them by calculating which words they have in common. Meaning and word alignment are not taken into consideration.
- ‡The CBOW method considers the words or tokens around to predict a token in the middle. In other words, the network will try to predict a word based on its surrounding context.
- §The skip-gram method uses the target token to predict the context. That is: instead of using the surrounding tokens to predict the target token, it will use the target token to predict the next token.
- 1.Ganesan K. What is text similarity? Kavita Ganesan, Ph.D. Accessed January 8, 2021. https://kavita-ganesan.com/what-is-text-similarity/#.X_jTwpNKiEs
- 2.Goutam M, Partha P, Alexander G, David P. Semantic Textual Similarity Methods, Tools, and Applications: A Survey. SciELO.
- 3.Sieg A. Text Similarities: Estimate the degree of similarity between two texts. Medium. Published July 4, 2018. Accessed January 9, 2021. https://medium.com/@adriensieg/text-similarities-da019229c894
- 4.Böhm T. The General Ideas of Word Embeddings. Towards Data Science. Published December 30, 2018. Accessed January 11, 2021. https://towardsdatascience.com/the-three-main-branches-of-word-embeddings-7b90fa36dfb9
- 5.Devlin J, Chang M-W. Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google Blog. Published November 2, 2018. Accessed January 11, 2021. https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html