Image Source: Fernando Arcos
Common Data and Model issues you should be aware of and check when working with data and when training Machine Learning or Deep Learning models.
Expect problems and eat them for breakfast. Alfred A. Montapert
In this article we will talk about the most common data problems you should know and check when conducting data analysis and modelling, as well as the most common model issues when training a Machine Learning or Deep Learning Model.
- Data Problems
- Missing Data
- Insufficient Data
- Errors in Data
- Imbalanced Data
- Biased Sample
- Model Problems
- Model Assumptions
- Overfitting
- Not Sustainable Model in Long Term
- Not Compatible Model
Online, at universities, in courses and in bootcamps we talk a lot about ML or DL models for solving various business and software problems. Often times we assume that the chosen model is universal or that the data in a perfect shape, cleaned and checked, is stored somewhere, ready to be used.
Sadly, this couldn’t be further from the truth. Most of the time, the data is dirty and contains many issues, which should be checked and resolved before even considering using it for training a model and making formal recommendations to the Product Team.
So, we don’t discuss much this the set of possible problems that one should be aware of and check when implementing an ML or DL model.
How to Start a Career in Data Science
LunarTech.ai published at FreeCodeCamp
There are usually two type of problems:
1: Data Problems
2: Model Problems
💡Missing Data: use mean imputation, KNN Imputation, Single Imputation (SI), or Multiple Imputation(MI) to fill in missing value (if that make sense). Check out this case study with all these imputation techniques and their checks in it:
[TatevKaren-data-science-portfolio/Missing-Data-Imputation-Case-Study at main ·…
Data Science Portfolio of Tatev Karen Aslanyan including Case Studies and Research Projects that I have completed that…github.com](https://github.com/TatevKaren/TatevKaren-data-science-portfolio/tree/main/Missing-Data-Imputation-Case-Study "github.com/TatevKaren/TatevKaren-data-scien..")
💡Insufficient Data: use Bootstrapping or MC Simulation that will keep the data distribution but will increase the data amount. For Monte Carlo Simulation, check out the following detailed blog:
[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")
💡Errors in Data: identify outliers and errors in the data and remove them. You can use Boxplots to plot your data to identify these outliers. You can also use the IQR(Inter Quantile Range) approach to view the data’s smallest and largest values and check if they make sense. For instance, if you see that the salary for a job posts is equal to 0, then most likely this is an error.
💡Imbalanced Data: When in classification problem, there is a significant difference in the amount of data points per class. We want to have a data where per class the number of observations is approximately the same. You can use undersampling or oversampling depending on your data nature.
💡Biased Sample: your training sample is not a true representation of your population. You can use more advanced sampling techniques to solve this problem such as Stratified Sampling. Check out my article on various Data Sampling methods here:
[Data Sampling Methods in Python
A ready-to-run code with different data sampling techniques to create a random and representative sample in Pythontowardsdatascience.com](https://towardsdatascience.com/data-sampling-methods-in-python-a4400628ea1b "towardsdatascience.com/data-sampling-method..")
💡Model Assumptions: for instance if you are using Linear Regression you should check for 5 fundamental assumptions of this model before selecting it as your final model.
For instance, if you wish to use the simplest ML model, Linear Regression, LR often doesn’t work even if the underlying relationship is linear is because on of the 5 LR assumptions is violated. To avoid this problem, make sure you check each of these assumptions. Here is how you can do it:
For more about these and Linear Regression check this blog:
[Complete Guide to Linear Regression
Everything you need to know about the simplest yet the most popular Machine Learning regression modelpub.towardsai.net](https://pub.towardsai.net/complete-guide-to-linear-regression-86c5eddb7eda "pub.towardsai.net/complete-guide-to-linear-..")
💡Overfitting: when the model performs well on the training data but performs poorly/worse on unseen test data. Use regularization, Ridge/Lasso in common ML models, and Drop Out method in DL model. Check out my article about overfitting and its remedy here:
[Bias-Variance Trade-Off, Overfitting and Regularization in Machine Learning
Introduction to bias-variance trade-off, overfitting & how to solve overfitting using regularization: Ridge and Lasso…towardsdatascience.com](https://towardsdatascience.com/bias-variance-trade-off-overfitting-regularization-in-machine-learning-d79c6d8f20b4 "towardsdatascience.com/bias-variance-trade-..")
💡Not Sustainable Model in Long Term: when the model is for instance part of library that no longer gets updated such as Learning To Rank (LtR) in RankyMcRankFace library, then you can better look for alternative model.
💡Not Compatible Model: If your model is not available in a Programming Language or tool, e.g. PySpark while your baseline models, companies pipelines and APIs are all in Pyspark and on Cloud, it will be hard to combine these models and will require more engineering resources for its productionalization and support.
[Simple and Complete Guide to A/B Testing
End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…towardsdatascience.com](https://towardsdatascience.com/simple-and-complet-guide-to-a-b-testing-c34154d0ce5a "towardsdatascience.com/simple-and-complet-g..")
[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")
[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")
[My Strategy on Writing Blogs With More Than 50K Views and High Engagement
Uncovering the strategy I used in writing blogs that got me more than 50K views and consistent engagement over a yearmedium.com](https://medium.com/geekculture/how-to-write-blogs-with-more-than-50k-views-and-high-engagement-9d4e094686a5 "medium.com/geekculture/how-to-write-blogs-w..")
Hello, fellow enthusiasts! I am Tatev, the person behind the insights and information shared in this blog. My journey in the vibrant world of Data Science and AI has been nothing short of incredible, and it’s a privilege to be able to share this wealth of knowledge with all of you.
Connect and Learn More:
Feel free to connect; whether it’s to discuss the latest trends, seek career advice, or just to share your own exciting journey in this field. I believe in fostering a community where knowledge meets passion, and I’m always here to support and guide aspiring individuals in this vibrant industry.
Want to learn everything about Data Science and how to land a Data Science job? Download this FREE Data Science and AI Career Handbook