Today, I will show you how I use NLP in practice and text mining to develop a Real Estate data website.
Learning how to tokenize sentences and creating topic clusters is fun. But you will start to get bored if you just don’t put your knowledge to work with a practical NLP project.
Today, I will introduce you to Essencialia, a website that showcases data about the city of Aracaju, in Brazil, and advises buyers and sellers about the best prices to buy and sell Real Estate properties in each neighborhood.
The motivation behind this project was to exercise my NLP knowledge, but also to make my life simpler. Besides NLP, I also have a crush on Real Estate and I was interested in finding the best investment opportunities in this particular city.
The first thing you have to know is that Real Estate sites in Brazil are TERRIBLE. Forget your Zillow experience. In Brazil, people use some sort of Craigslist to list their properties. Here, you have access to the basics only in terms of information.
Also, it’s almost impossible to find some historical data about the properties openly available. I honestly don’t know any easy way to discover how much was the price of one property 5 years ago.
Having this mess in mind. In the next paragraphs, I will discuss the phases of implementation of this project, the technologies that I have used, and my learnings.
NLP in practice: part 1 – what I wanted to achieve
Simply putting, I wanted to transform this:
From there, I wanted to use this data to calculate:
- The price quartiles;
- The average prices for buying and renting;
- The boroughs with the best investment rates;
- The average sizes of properties being sold right now;
- In which direction the city is expanding.
NLP in practice: part 2 – Data collection
The data collection part consists of visiting the most famous listing websites in the city – as well as the real estate offices – and scrapping their contents. You have a few possibilities, but in my case, I use BeautifulSoup and, sometimes when needed, Selenium, for the scrapping part.
NLP in practice: part 3 – Data extraction and cleaning
As you saw, part of the data is already available in a structured way and can be extracted with BeutifulSoup and some use of
For the most complex data, I use regex in combination with a Spacy model that I trained myself to find the info that I need. After collecting the data, I have to do some labeling to train the model and later extract the entities hidden in the text.
So, i use, in this order:
- Basic strings replacements;
- A NER model for more complex data
Let’s get this data cleaning done!
NLP in practice: part 4 – Exploratory Analysis
After doing this first preprocessing step, it’s important to do some exploratory analysis to find odd things.
For instance, I discovered that some listings had incredibly low prices for apartments that were supposed to be in a rich neighborhood. After taking a closer look, I saw that real estate agents usually advertised these properties in the wrong neighborhoods only to capture attention for other areas of the city.
It’s important to check at these things because once I found this out, I had to find a way to “estimate” if the listing was in the right place. I did this by:
- Extracting information using my NER model to discover the location of the listing (and therefore to send them to the right place);
- When it didn’t work, I had to remove the outliers…
NLP in practice: part 5 – Other insights
One thing that you may notice when analyzing the data is that apartments facing the south are more expensive than the ones facing the north.
But regardless of what you could think if you live in Europe or in the USA, these properties are not more requested because they face the Sun, but quite the opposite.
Remember: Brazil is in the southern hemisphere, so properties facing the north receive more light. In Aracaju, having a property facing the north means that you will live like you were inside an oven. And will potentially spend a lot of money with air conditioning.
Since electricity is expensive in Brazil, this can bring the price of your property down and decrease the odds of renting your apartment if you are an investor.
Thus the most expensive properties face the south.
Listings of apartments with this configuration will likely have the “sombra total” mention on the ad (something like “full shadow” in English). However, this mention appears only when the property is facing the southeast.
A property facing the southwest will only receive the mention “sombra” – “shadowed” -, while properties facing the north almost never mention their situation. Sometimes, you may find the mention “sol da manhã” or “morning sun”, which means that the property is facing the northeast.
This only one of the many insights that you may have while analyzing data. These are the kinds of things that you only learn after working with NLP in practice and having some knowledge of the field that you are analyzing.
NLP in practice: part 6 – The website
Now that we have the data and the insights, it’s time to build the website. For this, I have used a regular CMS and a bunch of plugins to create the layout.
Now, it’s time to think about User Experience and play around with the way the content is displayed on the page. I have also worked on the SEO to help the pages to rank for a bunch of keywords (or user intentions) that I have selected.
The last part of the project is doing some social media work to attract visitors. Here, I am also using some automation to create the posts and interact with users.
And you, have you been working on a practical NLP project? Can you give examples of text mining and Natural Language applications in real life?