Learn how to build killer datasets by avoiding the most frequent mistakes in Data Science, plus tips, tricks and kittens. Introduction If you haven’t heard it already, let me tell you a truth that you should, as a , always keep in a corner of your head: data scientist “Your results are only as good as your data.” Many people make the mistake of trying to for their ugly by their . This is the equivalent of buying a because your old car doesn’t perform well with . It makes much more sense to the instead of the . In this article, I will explain how you can easily your by your . compensate dataset improving model supercar cheap gasoline refine oil upgrading car improve results enhancing dataset Note : I will take the task of image classification as an example, but these tips can be applied to all sorts of datasets. The 6 Most frequent mistakes, and how to fix them. 1. Not enough data. If your dataset is too , your model doesn’t have enough examples to find that will be used to . It will then your data, resulting in a but a . small discriminative features generalize overfit low training error high test error gather more data. You can try to find more from the as your original dataset, or from if the images are quite similar or if you to . Solution #1: same source another source absolutely want generalize This is usually not an easy thing to do, at least without investing time and money. Also, you might want to do an to determine additional data you need. Compare your results with and try to . Caveats: analysis how much different dataset sizes, extrapolate In this case, it seems that we would need to reach our . That would mean gathering data as we have for the moment. It is probably more to work on other of the data, or on the . 500k samples target error 50 times as much efficient aspects model augment your data by creating multiple copies of the same image with slight variations. This technique works wonders and it produces tons of additional images at a really low cost. You can try to , , or your image. You can , , or In all cases, you need to make sure the data is . Solution #2: crop rotate translate scale add noise blur it change its colors obstruct parts of it. still representing the same class All this images still represent the “cat” category This can be extremely powerful, as stacking these effects gives exponentially numerous samples for your dataset. Note that this is still usually to collecting . inferior more raw data Combined data augmentation techniques. The class is still “cat” and should be recognized as such. all augmentations techniques might not be usable for your problem. For example, if you want to classify Lemons and Limes, don’t play with the hue, as it would that color is important for the classification. Caveats: make sense This type of data augmentation would make it harder for the model to find discriminating features. 2. Low quality classes It’s an easy one, but take time to go through your dataset if possible, and of each sample. This might take a while, but having in your dataset will be to the learning process. verify the label counter-examples detrimental Also, choose the right level of for your classes. Depending on the problem, you might need more or less classes. , you can classify the image of a with a to determine it’s an , then run it through an to determine it’s a . A huge model could do both, but it would be much harder. granularity For example kitten global classifier animal animal classifier kitten Two stage prediction with specialized classifiers. 3. Low quality data As said in the introduction, will only lead to . low quality data low quality results You might have samples in your dataset f your dataset that are too far from what you want to use. These might be more for the model than helpful. confusing : the worst images. This is a lengthy process, but will improve your results. Solution remove Sure, these three images represent cats, but the model might not be able to work with it. Another issue is when your dataset is made of data that the . if the images are taken from completely different sources. common doesn’t match real world application For instance think about the long term application of your technology, and which means will be used to acquire data in production. If possible, try to find/build a dataset with the same tools. Solution: Using data that doesn’t represent your real world application is usually a bad idea. Your model is likely to extract features that won’t work in the real world. 4. Unbalanced classes If the of sample per class the for all classes, the model might have a tendency to favor the dominant class, as it results in a . We say that the model is because the is . This is a serious issue, and also why you need to take a look at or number isn’t roughly same lower error biased class distribution skewed precision, recall confusion matrixes. gather more samples of the classes. However, this is in and , or simply Solution #1: underrepresented often costly time money not feasible. over/under-sample your data. This means that you some samples from the classes, and/or samples from the classes. Better than , use data augmentation as seen previously. Solution #2: remove over-represented duplicate under-represented duplication We need to the class (cat) and some samples from the class (lime). This will give a much smoother class distribution. augment under-represented leave aside over-represented 5. Unbalanced data If your data doesn’t have a , or if the values don’t lie in the , your model might have trouble dealing with it. You will have better results with image that are in and . specific format certain range aspect ratio pixel values or the data so that it has the same aspect or as the other samples. Solution #1: Crop stretch format Two possibilities to improve a badly formatted image. the data so that every sample has its data in the value . Solution #2: normalize same range The value range is normalized to be consistent across the dataset. 6. No validation or testing Once your dataset has been , and properly , you need to it. Many people split it the following way: for , and for which allow you to easily spot , if you are trying multiple models on the same testing set, something else happens. By the model giving the best test accuracy, you are in fact . This happens because you are manually selecting a model for its , but for its on a set of data. cleaned augmented labelled split 80% training 20% testing, overfitting. However picking overfitting the testing set not intrinsic value performance specific split the dataset in three: , and . This your testing set from being by the . The selection process becomes: Solution: training validation testing shields overfitted choice of the model your models on the Train training set. them on the to make sure you aren’t . Test validation set overfitting Pick the most promising model. it on the , this will give you the of your model. Test testing set true accuracy : Once you have your model for , don’t forget to train it on the ! The more data the better! Note chosen production whole dataset Conclusion I hope by now you are that you must pay attention to your before even thinking about your model. You now know the biggest mistakes of working with data, how to avoid the , plus and on how to build ! In case of doubt, remember: convinced dataset pitfalls tips tricks killer datasets . “The winner is not the one with the best model, it’s the one with the best data.” 🎉 You’ve reached the end! I hope you enjoyed this article. If you did, please like it, share it, subscribe to the newsletter, send me pizzas, follow me on medium, or do whatever you feel like doing! 🎉 If you like Data Science and Artificial Intelligence, subscribe to the newsletter to receive updates on articles and much more!