Image Source: Maksim Goncharenok
There are no solutions, there are only trade-offs!
— Thomas Sowell
Whenever you are using a Statistical, Econometrical, or Machine Learning model, no matter how simple the model is, you should always evaluate your model and check its error rate. In all these cases it comes down to the trade-off you make between the variance of your model and the bias of your model because there is always a catch when it comes to the model choice and its performance. In this blog post, I will cover the following topics :
- Model Error Rate
- What is Overfitting
- Reduccabble vs Irreduccable Error
- Model Bias
- Model Variance
- Bias-Variance Trade-Off
If you have no prior Statistical knowledge or you want to refresh your knowledge in the essential statistical concepts before jumping to the formulas in this article and other Statistical and ML concepts, you can check this article: Fundamentals of statistics for Data Scientists and Data Analysts
In order to evaluate the performance of the model, we need to look at the amount of error it’s making. For simplicity, let’s assume we have the following simple regression model which aims to use a single independent variable X to model the numeric Y dependent variable, that is we fit our model on our training observations {(x_1,y_1),(x_2,y_2),…,(x_n,y_n)} and we obtain the estimate f^t (f_hat).
We can then compute f^(x_1),f^(x_2),…,f^(x_n). If these are approximately equal to y_1,y_2,…,y_n, then the training error rate (e.g. MSE) would be small. However, we are really not interested in whether f(x_k) ≈ y_k; instead, we want to know whether f(x_0) is approximately equal to y_0, where (x_0, y_0) is an unseen test data point, not used during the training of the model. We want to choose a method that gives the lowest test error rate, as opposed to the lowest training error rate. Mathematically, the model error rate of this example method can be expressed as follows:
The fundamental problem with using training error rate to evaluate the model performance is that there is no guarantee that the method with the lowest training error rate will also have the lowest test error rate. Roughly speaking, the problem is that many ML or statistical methods specifically estimate model coefficients or parameters to minimize the training error rate. For these methods, the training error rate can be quite small, but the test error rate is often much larger.
The fundamental problem with using training error rate to evaluate the model performance is that there is no guarantee that the method with the lowest training error rate will also have the lowest test error rate. We want to choose a method that gives the lowest test error rate, as opposed to the lowest training error rate.
The accuracy of yˆ as a prediction for y depends on two quantities, which we can call the reducible error and the irreducible error. In general, fˆ will not be a perfect estimate for f, and this inaccuracy will introduce some errors. This error is reducible since we can potentially improve the accuracy of fˆ by using the most appropriate Machine Learning model to estimate f. However, even if it was possible to find a model that would estimate f perfectly so that the estimated response took the form yˆ = f(x), our prediction would still have some error in it. This happens because y is also a function of ε, which, by definition, cannot be predicted using x.
So, variability associated with error ε also affects the accuracy of the predictions. This is known as the irreducible error because no matter how well we estimate f, we cannot reduce the error introduced by ε.
Hence, irreducible error in the model is the variance of the error terms ε and can be described by the following formula.
Unlike reducible error, irreducible errors is an error that we cannot reduce by choosing a better model which arises due to randomness or natural variability in a system.
The inability of the model to capture the true relationship in the data is called bias. Hence, the ML models that are able to detect the true relationship in the data, have low bias. Usually, complex models or more flexible models tend to have a lower bias than simpler models. Mathematically, the bias of the model can be expressed as follows:
The inability of the Machine Learning model to capture the true relationship in the data is called bias.
The variance of the model is the inconstancy level of model performance when applying the model to different data sets. When the same model that is trained using training data performs entirely differently than on test data this means there is a large variance in the model. Complex models or more flexible models tend to have higher variance than simpler models.
The variance of the model is the inconstancy level of model performance when applying the model to different data sets.
It can be mathematically proven that the expected test error rate of the Machine Learning model, for a given value x0, can be described in terms of the Variance of the model, the Bias of the model, and the irreducible error of the model. More specifically, the error in the supervised Machine Learning model is equal to the sum of the Variance of the model, squared Bias, and the irreducible error of the model.
So, mathematically, the error in the supervised model is equal to squared Bias in the model, the variance of the model, and the irreducible error. Hence, in order to minimize the expected test error rate, we need to select a Machine Learning method that simultaneously achieves low variance and low bias. However, there is a negative correlation between the Variance and the bias of the Model.
Complex models or more flexible models tend to have a lower bias but at the same time, complex models or more flexible models tend to have higher variance than simpler models.
Image Source: Author
As a general rule, as the flexibility of the methods increases, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test error rate will increase or decrease.
Mathematically, the error in the supervised model is equal to squared Bias in the model, the variance of the model, and the irreducible error. That is:
As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test error rate declines. However, at some point, increasing flexibility has little impact on the bias but starts to significantly increase the variance. So, it’s all about finding that balance, the best-fit point, where the Test Error Rate is about to change its direction and move upwards.
Image Source: Author
Based on the Bias and Variance relationship a Machine Learning model can have 4 possible scenarios:
Complex models or more flexible models tend to have a lower bias but at the same time, complex models or more flexible models tend to have higher variance than simpler models.
[Data Sampling Methods in Python
A ready-to-run code with different data sampling techniques to create a random sample in Pythontatev-aslanyan.medium.com](https://tatev-aslanyan.medium.com/data-sampling-methods-in-python-a4400628ea1b "tatev-aslanyan.medium.com/data-sampling-met..")
[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")
[Simple and Complete Guide to A/B Testing
End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…towardsdatascience.com](https://towardsdatascience.com/simple-and-complet-guide-to-a-b-testing-c34154d0ce5a "towardsdatascience.com/simple-and-complet-g..")
[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")
[PySpark Cheat Sheet: Big Data Analytics
Here is a cheat sheet for the essential PySpark commands and functions. Start your big data analysis in PySpark.medium.com](https://medium.com/analytics-vidhya/pyspark-cheat-sheet-big-data-analytics-161a8e1f6185 "medium.com/analytics-vidhya/pyspark-cheat-s..")
Hello, fellow enthusiasts! I am Tatev, the person behind the insights and information shared in this blog. My journey in the vibrant world of Data Science and AI has been nothing short of incredible, and it’s a privilege to be able to share this wealth of knowledge with all of you.
Connect and Learn More:
Feel free to connect; whether it’s to discuss the latest trends, seek career advice, or just to share your own exciting journey in this field. I believe in fostering a community where knowledge meets passion, and I’m always here to support and guide aspiring individuals in this vibrant industry.
Want to learn everything about Data Science and how to land a Data Science job? Download this FREE Data Science and AI Career Handbook
I encourage you to join Medium today to have complete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.
Happy learning!