Some people start to learn Machine Learning through tutorials, online courses and Kaggle competitions.
My path was a little bit different: my first contact with Data Science was in university.
But even there, sometimes teachers used internet available datasets to teach basic concepts quickly.
The problem with the universally known datasets approach is that, depending on who your teacher is – and assuming that some people are teaching themselves how to do machine learning – focusing on learning this way will keep you away of understanding how acquiring data is an important step on Machine Learning.
This is what I call the spoon feeding dataset problem.
My first dataset
In the beginning of my journey, I was always going to Kaggle, to look for interesting notebooks to understand how people solved the competitions problems.
Whilst understanding the step-by-step instructions provided to build and train a model was not a problem, I had no clue that the most difficult task on machine learning is actually how to get and to clean data.
I just discovered how hard getting and preprocessing data is when I tried to do it myself, for a sentiment analysis project which I used some tweets.
The experience was HORRIBLE! Using APIs, understanding their limits, dealing with anonymization routines, discarding half of the tweets because they were nonsense – and later finding out that this nonsense was actually a key to solve the problem that I was investigating…
People do mistakes when they write. Specially on social media, where you are free to write as you want.
Normalizing text, in these cases, is a nightmare. How to deal with ⚛️🧴😊, loooooooong words, OMGs, sıɥʇ ǝʞıl sɓuıɥʇ, ≋o≋r≋ ≋t≋h≋i≋s≋, •´¯
•. 🎀 🍪𝓇 𝑒𝓋𝑒𝓃 𝓉𝒽𝒾𝓈 🎀 .•¯´• (by the way, you can generate this kind of text here if this is your 𝖙𝖍𝖎𝖓𝖌).
Really? No one prepared me to this!
And what say about labelling data for classification? Labelling is literally hell. Labelling things like “is this good or bad?”, ‘is this too much?”, “is this hate speech?” can turn out to be a challenge if the rules are not set since you start the labelling process.
I am only taking about text because this is where I have more experience. But images, videos, sounds and every piece of data available is challenging.
So, it’s really a pitty that people do not use a part of their times to learn how to find, clean and make data available to others.
It’s easier when data comes already clean
I understand that’s easier to start to learn machine learning concepts using clean, beautiful, labelled data.
But, what will you do when you start to work on real world problems?
How do you decide which variables are important? Where do you find this data? CAN YOU USE THIS DATA? What are the risks of using this data? Is the data normalized? Is normalizing it hard? Is the cost of getting this data worth it? If you normalize it, may you lose information? And if you lose information, will your model still be able to generalize?
Ask yourself these questions.
While you do it, take a look at this notebook. It works on the MNIST numbers handwriting problem using Tensorflow.
Try to identify which transformations were already made in this dataset so you can use it and achieve good results.
Did you think about this dataset? Here’s the questions that I think you should be doing:
- Did you notice that all these images were already in a standard and beautiful square format?
- What would you have to do if one image is 28 x 28 pixels and another one is 640 x 520?
- All images are in Black and White. What if you come across images with different colors?
- All backgrounds seem to be white. What would you have to do if your images had different backgrounds?
- What if you have more than one number per image?
- How can you get numbers like this to create your own dataset?
- If you want to do predictions using your own images, which transformations do you have to apply to your image so you can submit it to a prediction?
- How do you import multiple files saved on your disk?
- How do you split data for training and for testing?
- When processing data, do you save these clean data and delete the original, do you keep everything or do you do transformations on the fly?
These are only a few questions to think about.
Remember that, when you move from a developer position to a machine learning or data scientist one, one of your tasks will be to care about data quality.
Are you using your time to learn more about this?