Back
Common Data and Model Problems with Solutions in Data Analytics and Data Science
November 18, 2024

Common issues in Data Analytics and Data Science Projects you should be aware of, how to check and how to fix them

Image Source: Fernando Arcos

Common Data and Model issues you should be aware of and check when working with data and when training Machine Learning or Deep Learning models.

Expect problems and eat them for breakfast. Alfred A. Montapert

In this article we will talk about the most common data problems you should know and check when conducting data analysis and modelling, as well as the most common model issues when training a Machine Learning or Deep Learning Model.

- Data Problems
- Missing Data
- Insufficient Data
- Errors in Data
- Imbalanced Data
- Biased Sample
- Model Problems
- Model Assumptions
- Overfitting
- Not Sustainable Model in Long Term
- Not Compatible Model

Online, at universities, in courses and in bootcamps we talk a lot about ML or DL models for solving various business and software problems. Often times we assume that the chosen model is universal or that the data in a perfect shape, cleaned and checked, is stored somewhere, ready to be used.

Sadly, this couldn’t be further from the truth. Most of the time, the data is dirty and contains many issues, which should be checked and resolved before even considering using it for training a model and making formal recommendations to the Product Team.

So, we don’t discuss much this the set of possible problems that one should be aware of and check when implementing an ML or DL model.

FREE Data Science and AI Handbook

How to Start a Career in Data Science

LunarTech.ai published at FreeCodeCamp

Download for FREE here

There are usually two type of problems:
1: Data Problems
2: Model Problems

Common Data Problems

💡Missing Data: use mean imputation, KNN Imputation, Single Imputation (SI), or Multiple Imputation(MI) to fill in missing value (if that make sense). Check out this case study with all these imputation techniques and their checks in it:

[TatevKaren-data-science-portfolio/Missing-Data-Imputation-Case-Study at main ·…
Data Science Portfolio of Tatev Karen Aslanyan including Case Studies and Research Projects that I have completed that…github.com](https://github.com/TatevKaren/TatevKaren-data-science-portfolio/tree/main/Missing-Data-Imputation-Case-Study "github.com/TatevKaren/TatevKaren-data-scien..")

💡Insufficient Data: use Bootstrapping or MC Simulation that will keep the data distribution but will increase the data amount. For Monte Carlo Simulation, check out the following detailed blog:

[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")

💡Errors in Data: identify outliers and errors in the data and remove them. You can use Boxplots to plot your data to identify these outliers. You can also use the IQR(Inter Quantile Range) approach to view the data’s smallest and largest values and check if they make sense. For instance, if you see that the salary for a job posts is equal to 0, then most likely this is an error.

💡Imbalanced Data: When in classification problem, there is a significant difference in the amount of data points per class. We want to have a data where per class the number of observations is approximately the same. You can use undersampling or oversampling depending on your data nature.

💡Biased Sample: your training sample is not a true representation of your population. You can use more advanced sampling techniques to solve this problem such as Stratified Sampling. Check out my article on various Data Sampling methods here:

[Data Sampling Methods in Python
A ready-to-run code with different data sampling techniques to create a random and representative sample in Pythontowardsdatascience.com](https://towardsdatascience.com/data-sampling-methods-in-python-a4400628ea1b "towardsdatascience.com/data-sampling-method..")

Model Problems

💡Model Assumptions: for instance if you are using Linear Regression you should check for 5 fundamental assumptions of this model before selecting it as your final model.

Example

For instance, if you wish to use the simplest ML model, Linear Regression, LR often doesn’t work even if the underlying relationship is linear is because on of the 5 LR assumptions is violated. To avoid this problem, make sure you check each of these assumptions. Here is how you can do it:

  • A1: Linearity Plot the residuals to fitted values and if the pattern is non-linear then the estimates will be biased → linearity assumption is violated
  • A2: Random Sample Plot the residuals and check whether the mean of these residuals is 0, otherwise, the estimates will be biased and this assumptions is violated. This suggests that you systematically over or under predict the y.
  • A3: Exogeneity Check for Reverse Causality or for Omitted Variable Bias which both can lead to a correlation between one or more independent variables and the error term. For Reverse Causality use Hausman Test, if positive result → use IV estimation to fix this problem. For checking Omitted variable bias, plot residuals to each of the independent variables and see whether there is a clear pattern, if yes then correlation (endogeneity)→ use Heckman 2 step.
  • A4: Homoskedasticity Plot residuals and see whether there is a funnel-like graph, if not then the variance of error terms is not constant and you have heteroscedasticity. Think about using GLS, GMM etc.
  • A5: No Perfect Multi-Collinearity Conduct VIF To check for Multicollinearity you can use the VIF (Variable Inflation Factors) test which determines the strength of the correlation between the independent variables.

For more about these and Linear Regression check this blog:

[Complete Guide to Linear Regression
Everything you need to know about the simplest yet the most popular Machine Learning regression modelpub.towardsai.net](https://pub.towardsai.net/complete-guide-to-linear-regression-86c5eddb7eda "pub.towardsai.net/complete-guide-to-linear-..")

💡Overfitting: when the model performs well on the training data but performs poorly/worse on unseen test data. Use regularization, Ridge/Lasso in common ML models, and Drop Out method in DL model. Check out my article about overfitting and its remedy here:

[Bias-Variance Trade-Off, Overfitting and Regularization in Machine Learning
Introduction to bias-variance trade-off, overfitting & how to solve overfitting using regularization: Ridge and Lasso…towardsdatascience.com](https://towardsdatascience.com/bias-variance-trade-off-overfitting-regularization-in-machine-learning-d79c6d8f20b4 "towardsdatascience.com/bias-variance-trade-..")

💡Not Sustainable Model in Long Term: when the model is for instance part of library that no longer gets updated such as Learning To Rank (LtR) in RankyMcRankFace library, then you can better look for alternative model.

💡Not Compatible Model: If your model is not available in a Programming Language or tool, e.g. PySpark while your baseline models, companies pipelines and APIs are all in Pyspark and on Cloud, it will be hard to combine these models and will require more engineering resources for its productionalization and support.

If you liked this article, here are some other articles you may enjoy:

[Simple and Complete Guide to A/B Testing
End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…towardsdatascience.com](https://towardsdatascience.com/simple-and-complet-guide-to-a-b-testing-c34154d0ce5a "towardsdatascience.com/simple-and-complet-g..")

[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")

[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")

[My Strategy on Writing Blogs With More Than 50K Views and High Engagement
Uncovering the strategy I used in writing blogs that got me more than 50K views and consistent engagement over a yearmedium.com](https://medium.com/geekculture/how-to-write-blogs-with-more-than-50k-views-and-high-engagement-9d4e094686a5 "medium.com/geekculture/how-to-write-blogs-w..")

About the Author — That’s Me!

Hello, fellow enthusiasts! I am Tatev, the person behind the insights and information shared in this blog. My journey in the vibrant world of Data Science and AI has been nothing short of incredible, and it’s a privilege to be able to share this wealth of knowledge with all of you.

Connect and Learn More:

Feel free to connect; whether it’s to discuss the latest trends, seek career advice, or just to share your own exciting journey in this field. I believe in fostering a community where knowledge meets passion, and I’m always here to support and guide aspiring individuals in this vibrant industry.

Want to learn everything about Data Science and how to land a Data Science job? Download this FREE Data Science and AI Career Handbook

News & Insights