Today’s challenge is a continuation of yesterday’s task. If you have created a database, even if it is small, today will be the time to explore it.
Again, since every project is different and you may choose to build a dataset of basically anything, there’s no notebook to share today 🙁
From my side, I am starting to analyse the data I scraped yesterday on a social media website.
The challenge for me is that I used a Chrome extension called web scraper to get the data and I have a lot of cleaning to do. I am doing some string manipulation and regex 😖 to normalize the data.
By the way, the site I am trying to scrape is full of traps for scrapers. This is natural: websites are getting more and more smart about people trying to get their content.
We are doing all this with a purely educational purpose, but some people may scrape websites to copy content or to do nasty things with it. So, it’s important to learn to diversify the ways you get your data.
Again, explore things like APIs, RSS feeds or just message the website and ask if they could provide the information you need.
Also, remember about data protection laws and do everything you can to avoid scraping personal data by accident.
During the next days, I will explore data for at least 1 hour per day and post the notebooks, data and models, when they are available, to this repository.