Image Source: Karolina-Grabowska
Correlation is not causation.
— Kenneth L. Woodward
The first thing you can read in every statistics book is that correlation is not causation. However, it’s also the first thing many students forget once they see their data and start to look for insights about it. Linear Regression is one of the most used and most popular Machine Learning or Statistical model that ais to detect causal relationships between a set of predictors and the response variable.
- Linear Regression Model Assumtions
- Checking Linear Regression Assumtions
- Ordinary Least Squares (OLS)
- OLS Estimates Properties (Bias, Consistency, Efficiency)
- Confidence Interval and Margin of Error
- Hypothesis testing
- Statistical significance testing
- Type I & Type II Errors
- Statistical tests (Student's t-test, F-test)
- Model Performance (Type I, Type II error, R-Squared, Adjusted R-Squared)
- Python Implementation
How to Start a Career in Data Science
LunarTech.ai published at FreeCodeCamp
Causation between variables is present when a variable has a direct impact on another variable. When the relationship between two variables is linear, then Linear Regression is a statistical method that can help to model the impact of a unit change in a variable, the independent variable, on the values of another variable, the dependent variable.
Dependent variables are often referred to as response variables or explained variables, whereas independent variables are often referred to as regressors or explanatory variables. When the Linear Regression model is based on a single independent variable, then the model is called Simple Linear Regression and when the model is based on multiple independent variables, it’s referred to as Multiple Linear Regression. Simple Linear Regression can be described by the following expression:
where Y is the dependent variable, X is the independent variable that is part of the data, β0 is the intercept which is unknown and constant, β1 is the slope coefficient or a parameter corresponding to the variable X, which is unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values.
The main idea behind linear regression is to find the best-fitting straight line, the regression line, through a set of paired ( X, Y ) data. One example of the Linear Regression application is modeling the impact of Flipper Length on penguins’ Body Mass, which is visualized below.
Image Source: The Author
Multiple Linear Regression with three independent variables can be described by the following expression:
The Linear Regression Machine Learning method makes the following assumption which needs to be satisfied to get reliable prediction results:
A1: Linearity assumption states that the model is linear in parameters.
A2: Random Sample assumption states that all observations in the sample are randomly selected.
A3: Exogeneity assumption states that independent variables are uncorrelated with the error terms.
A4: Homoskedasticity assumption states that the variance of all error terms is constant.
A5: No Perfect Multi-Collinearity assumption states that none of the independent variables is constant and there are no exact linear relationships between the independent variables.
The ordinary least squares (OLS) is a method for estimating the unknown parameters such as β0 and β1 in a linear regression model. The model is based on the principle of least squares that minimizes the sum of squares of the differences between the observed dependent variable and its values predicted by the linear function of the independent variable, often referred to as fitted values. This difference between the real and predicted values of dependent variable Y is referred to as residual and what OLS does, is minimize the sum of squared residuals. This optimization problem results in the following OLS estimates for the unknown parameters β0 and β1, which are also known as coefficient estimates.
Once these parameters of the Simple Linear Regression model are estimated, the fitted values of the response variable can be computed as follows:
The residuals or the estimated error terms can be determined as follows:
It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data. The OLS estimates the error terms for each observation but not the actual error term. So, the true error variance is still unknown. Moreover, these estimates are subject to sampling uncertainty. What this means is that we will never be able to determine the exact estimate, the true value, of these parameters from sample data in an empirical application. However, we can estimate it by calculating the sample residual variance by using the residuals as follows.
This estimate for the variance of sample residuals helps to estimate the variance of the estimated parameters, which is often expressed as follows:
The squared root of this variance term is called the standard error of the estimate, which is a key component in assessing the accuracy of the parameter estimates. It is used to calculate test statistics and confidence intervals. The standard error can be expressed as follows:
It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data.
Under the assumption that the OLS criteria A1 — A5 are satisfied, the OLS estimators of coefficients β0 and β1 are BLUE and Consistent.
Gauss-Markov theorem
This theorem highlights the properties of OLS estimates where the term BLUE stands for Best Linear Unbiased Estimator.
The bias of an estimator is the difference between its expected value and the true value of the parameter being estimated and can be expressed as follows:
When we state that the estimator is unbiased what we mean is that the bias is equal to zero, which implies that the expected value of the estimator is equal to the true parameter value, that is:
Unbiasedness does not guarantee that the obtained estimate with any particular sample is equal to or close to β. What it means is that if one repeatedly draws random samples from the population and then computes the estimate each time, then the average of these estimates would be equal to or very close to β.
The term Best in the Gauss-Markov theorem relates to the variance of the estimator and is referred to as efficiency*. A parameter can have multiple estimators, but the one with the lowest variance is called efficient*.
The term consistency goes hand in hand with the terms sample size and convergence. If the estimator converges to the true parameter as the sample size becomes very large, then this estimator is said to be consistent, that is:
Under the assumption that the OLS criteria A1 — A5 are satisfied, the OLS estimators of coefficients β0 and β1 are BLUE and Consistent.
Gauss-Markov Theorem
All these properties hold for OLS estimates as summarized in the Gauss-Markov theorem. In other words, OLS estimates have the smallest variance, they are unbiased, linear in parameters, and consistent. These properties can be mathematically proven by using the OLS assumptions made earlier.
The Confidence Interval is the range that contains the true population parameter with a certain pre-specified probability, referred to as the confidence level of the experiment, and it is obtained by using the sample results and the margin of error.
The margin of error is the difference between the sample results and based on what the result would have been if one had used the entire population.
The Confidence Level describes the level of certainty in the experimental results. For example, a 95% confidence level means that if one were to perform the same experiment repeatedly 100 times, then 95 of those 100 trials would lead to similar results. Note that the confidence level is defined before the start of the experiment because it will affect how big the margin of error will be at the end of the experiment.
As was mentioned earlier, the OLS estimates of the Simple Linear Regression, the estimates for intercept β0 and slope coefficient β1, are subject to sampling uncertainty. However, we can construct CI’s for these parameters, which will contain the true value of these parameters in 95% of all samples. That is, 95% confidence interval for β can be interpreted as follows:
95% confidence interval of OLS estimates can be constructed as follows:
which is based on the parameter estimate, the standard error of that estimate, and the value 1.96 representing the margin of error corresponding to the 5% rejection rule. This value is determined using the Normal Distribution table, which will be discussed later on in this article. Meanwhile, the following figure illustrates the idea of 95% CI:
Image Source: Wikipedia
Note that the confidence interval depends on the sample size as well, given that it is calculated using the standard error, which is based on sample size.
The confidence level is defined before the start of the experiment because it will affect how big the margin of error will be at the end of the experiment.
Testing a hypothesis in Statistics is a way to test the results of an experiment or survey to determine how meaningful they the results are. Basically, one is testing whether the obtained results are valid by figuring out the odds that the results have occurred by chance. If it is the letter, then the results are not reliable, and neither is the experiment. Hypothesis Testing is part of Statistical Inference.
Firstly, you need to determine the thesis you wish to test, then you need to formulate the Null Hypothesis and the Alternative Hypothesis. The test can have two possible outcomes, and based on the statistical results, you can either reject the stated hypothesis or accept it. As a rule of thumb, statisticians tend to put the version or formulation of the hypothesis under the Null Hypothesis that needs to be rejected, whereas the acceptable and desired version is stated under the Alternative Hypothesis.
Let’s look at the earlier mentioned example where the Linear Regression model was used to investigate whether a penguin’s Flipper Length, the independent variable, has an impact on Body Mass, the dependent variable. We can formulate this model with the following statistical expression:
Then, once the OLS estimates of the coefficients are estimated, we can formulate the following Null and Alternative Hypotheses to test whether the Flipper Length has a statistically significant impact on the Body Mass:
where H0 and H1 represent the Null Hypothesis and Alternative Hypothesis, respectively. Rejecting the Null Hypothesis would mean that a one-unit increase in Flipper Length has a direct impact on the Body Mass. Given that the parameter estimate of β1 is describing this impact of the independent variable, Flipper Length, on the dependent variable, Body Mass. This hypothesis can be reformulated as follows:
where H0 states that the parameter estimate of β1 is equal to 0, that is Flipper Length effect on Body Mass is statistically insignificant, whereas H0 states that the parameter estimate of β1 is not equal to 0, suggesting that Flipper Length effect on Body Mass is statistically significant*.*
When performing Statistical Hypothesis Testing, one needs to consider two conceptual types of errors: Type I error and Type II error. The Type I error occurs when the Null is wrongly rejected, whereas the Type II error occurs when the Null Hypothesis is wrongly not rejected. A confusion matrix can help to visualize the severity of these two types of errors clearly.
As a rule of thumb, statisticians tend to put the version the hypothesis under the Null Hypothesis that that needs to be rejected, whereas the acceptable and desired version is stated under the Alternative Hypothesis.
Once the Null and the Alternative Hypotheses are stated, and the test assumptions are defined, the next step is to determine which statistical test is appropriate and to calculate the test statistic. Whether or not to reject or not reject the Null can be determined by comparing the test statistic with the critical value. This comparison shows whether or not the observed test statistic is more extreme than the defined critical value, and it can have two possible results:
The critical value is based on a prespecified significance level α (usually chosen to be equal to 5%) and the type of probability distribution the test statistic follows. The critical value divides the area under this probability distribution curve into the rejection region(s) and non-rejection region. There are numerous statistical tests used to test various hypotheses. Examples of Statistical tests are Student’s t-test, F-test, Chi-squared test, Durbin-Hausman-Wu Endogeneity test, and White Heteroskedasticity test. In this article, we will look at two of these statistical tests.
The Type I error occurs when the Null is wrongly rejected whereas the Type II error occurs when the Null Hypothesis is wrongly not rejected.
One of the simplest and most popular statistical tests is the Student’s t-test. which can be used for testing various hypotheses, especially when dealing with a hypothesis where the main area of interest is to find evidence for the statistically significant effect of a single variable*. The test statistics of the t-test follow [**Student’s t distribution*](en.wikipedia.org/wiki/Student%27s_t-distrib..) and can be determined as follows:
where h0 in the nominator is the value against which the parameter estimate is being tested. So, the t-test statistics are equal to the parameter estimate minus the hypothesized value divided by the standard error of the coefficient estimate. In the earlier stated hypothesis, where we wanted to test whether Flipper Length has a statistically significant impact on Body Mass or not. This test can be performed using a t-test, and the h0 is, in that case, equal to 0 since the slope coefficient estimate is tested against the value 0.
There are two versions of the t-test: a two-sided t-test and a one-sided t-test. Whether you need the former or the latter version of the test depends entirely on the hypothesis that you want to test.
The two-sided or two-tailed t-test can be used when the hypothesis is testing an equal versus not equal relationship under the Null and Alternative Hypotheses that is similar to the following example:
The two-sided t-test has two rejection regions as visualized in the figure below:
In this version of the t-test, the Null is rejected if the calculated t-statistics is either too small or too large.
Here, the test statistics are compared to the critical values based on the sample size and the chosen significance level. To determine the exact value of the cutoff point, the two-sided t-distribution table can be used.
The one-sided or one-tailed t-test can be used when the hypothesis is testing a positive/negative versus negative/positive relationship under the Null and Alternative Hypotheses that is similar to the following examples:
The one-sided t-test has a single rejection region, and depending on the hypothesis side, the rejection region is either on the left-hand side or the right-hand side, as visualized in the figure below:
In this version of the t-test, the Null is rejected if the calculated t-statistics is smaller/larger than the critical value.
F-test is another very popular statistical test often used to test hypotheses testing a joint statistical significance of multiple variables*.* This is the case when you want to test whether multiple independent variables have a statistically significant impact on a dependent variable. Following is an example of a statistical hypothesis that can be tested using the F-test:
where the Null states that the three variables corresponding to these coefficients are jointly statistically insignificant and the Alternative states that these three variables are jointly statistically significant. The test statistics of the F-test follow F distribution and can be determined as follows:
where the SSRrestricted is the sum of squared residuals of the restricted model, which is the same model excluding from the data the target variables stated as insignificant under the Null, the SSRunrestricted is the sum of squared residuals of the unrestricted model, which is the model that includes all variables, the q represents the number of variables that are being jointly tested for the insignificance under the Null, N is the sample size, and the k is the total number of variables in the unrestricted model. SSR values are provided next to the parameter estimates after running the OLS regression, and the same holds for the F-statistics as well. Following is an example of MLR model output where the SSR and F-statistics values are marked.
Image Source: Stock and Whatson
F-test has a single rejection region as visualized below:
Image Source: U of Michigan
If the calculated F-statistics is bigger than the critical value, then the Null can be rejected, which suggests that the independent variables are jointly statistically significant. The rejection rule can be expressed as follows:
Another quick way to determine whether to reject or to support the Null Hypothesis is by using p-values. The p-value is the probability of the condition under the Null occurring. Stated differently, the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. The smaller the p-value, the stronger is the evidence against the Null Hypothesis, suggesting that it can be rejected.
The interpretation of a p-value is dependent on the chosen significance level. Most often, 1%, 5%, or 10% significance levels are used to interpret the p-value. So, instead of using the t-test and the F-test, the p-values of these test statistics can be used to test the same hypotheses.
The following figure shows a sample output of an OLS regression with two independent variables. In this table, the p-value of the t-test, testing the statistical significance of the class_size variable’s parameter estimate, and the p-value of the F-test, testing the joint statistical significance of the class_size, and el_pct variables parameter estimates, are underlined.
Image Source: Stock and Whatson
The p-value corresponding to the class_size variable is 0.011 and when comparing this value to the significance levels 1% or 0.01 , 5% or 0.05, 10% or 0.1, then the following conclusions can be made:
So, this p-value suggests that the coefficient of the class_size variable is statistically significant at 5% and 10% significance levels. The p-value corresponding to the F-test is 0.0000, and since 0 is smaller than all three cutoff values; 0.01, 0.05, 0.10, we can conclude that the Null of the F-test can be rejected in all three cases. This suggests that the coefficients of class_size and el_pct variables are jointly statistically significant at 1%, 5%, and 10% significance levels.
def runOLS(Y,X):
# OLS esyimation Y = Xb + e --> beta_hat = (X'X)^-1(X'Y)
beta_hat = np.dot(np.linalg.inv(np.dot(np.transpose(X), X)), np.dot(np.transpose(X), Y))
# OLS prediction
Y_hat = np.dot(X,beta_hat)
residuals = Y-Y_hat
RSS = np.sum(np.square(residuals))
sigma_squared_hat = RSS/(N-2)
TSS = np.sum(np.square(Y-np.repeat(Y.mean(),len(Y))))
MSE = sigma_squared_hat
RMSE = np.sqrt(MSE)
R_squared = (TSS-RSS)/TSS
# Standard error of estimates:square root of estimate's variance
var_beta_hat = np.linalg.inv(np.dot(np.transpose(X),X))*sigma_squared_hat
SE = []
t_stats = []
p_values = []
CI_s = []
for i in range(len(beta)):
#standard errors
SE_i = np.sqrt(var_beta_hat[i,i])
SE.append(np.round(SE_i,3))
#t-statistics
t_stat = np.round(beta_hat[i,0]/SE_i,3)
t_stats.append(t_stat)
#p-value of t-stat p[|t_stat| >= t-treshhold two sided]
p_value = t.sf(np.abs(t_stat),N-2) * 2
p_values.append(np.round(p_value,3))
#Confidence intervals = beta_hat -+ margin_of_error
t_critical = t.ppf(q =1-0.05/2, df = N-2)
margin_of_error = t_critical*SE_i
CI = [np.round(beta_hat[i,0]-margin_of_error,3), np.round(beta_hat[i,0]+margin_of_error,3)]
CI_s.append(CI)
return(beta_hat, SE, t_stats, p_values,CI_s,
MSE, RMSE, R_squared)
[Data Sampling Methods in Python
A ready-to-run code with different data sampling techniques to create a random sample in Pythontatev-aslanyan.medium.com](https://tatev-aslanyan.medium.com/data-sampling-methods-in-python-a4400628ea1b "tatev-aslanyan.medium.com/data-sampling-met..")
[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")
[Simple and Complete Guide to A/B Testing
End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…towardsdatascience.com](https://towardsdatascience.com/simple-and-complet-guide-to-a-b-testing-c34154d0ce5a "towardsdatascience.com/simple-and-complet-g..")
[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")
Hello, fellow enthusiasts! I am Tatev, the person behind the insights and information shared in this blog. My journey in the vibrant world of Data Science and AI has been nothing short of incredible, and it’s a privilege to be able to share this wealth of knowledge with all of you.
Connect and Learn More:
Feel free to connect; whether it’s to discuss the latest trends, seek career advice, or just to share your own exciting journey in this field. I believe in fostering a community where knowledge meets passion, and I’m always here to support and guide aspiring individuals in this vibrant industry.
Want to learn everything about Data Science and how to land a Data Science job? Download this FREE Data Science and AI Career Handbook
I encourage you to join Medium today to have complete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.
Happy learning!