Fundamentals Of Machine Learning For Data Scientists and Machine Learning Engineers

Key Machine Learning Algorithms and Data Science concepts for Data Scientists, Machine Learning Researchers and Engineers with Examples and Python Implementations

- Linear Regression and Ordinary Least Squares(OLS)
- Logistic Regression and Maxiumum Likelihood Estimation(MLE)
- Linear Discriminant Analysis(LDA)
- Logistic Regression vs LDA
- Naive Bayes
- Naive Bayes vs Logistic Regression
- Decision Trees
- Bagging
- Random Forest
- Boosting
- Boosting: AdaBoost
- Boosting: Gradient Boosting Model (GBM)
- Boosting: Extreme Gradient Boosting (XGBoost)
- Feature Selection
- Feature Selection: Step-Wise (Backward/Forward)
- Cross Validation
- Cross Validation: Validation Set, LOOCV, K-Fold CV
- Optimal k in K-Fold CV
- Resampling Technique: Bootstrapping
- Optimization Techniques
- Optimization Techniques: Batch Gradient Descent (GD)
- Optimization Techniques: Stochastic Gradient Descent (SGD)
- Optimization Techniques: SGD with Momentum
- Optimization Techniques: Adam Optimizer

If you have no prior Statistical knowledge and you want to learn or refresh your knowledge in the essential statistical concepts you can check this article: Fundamentals of statistics for Data Scientists and Data Analysts.

This is a higher-level summary of the majority of essential ML concepts. If you want a complete guide to all ML essentials → ML course

Linear Regression

When the relationship between two variables is linear, then Linear Regression is a statistical method that can help to model the impact of a unit change in a variable, the independent variable on the values of another variable, the dependent variable.

Dependent variables are often referred to as response variables or explained variables, whereas independent variables are often referred to as regressors or explanatory variables. When the Linear Regression model is based on a single independent variable, then the model is called Simple Linear Regression and when the model is based on multiple independent variables, it’s referred to as Multiple Linear Regression. Simple Linear Regression can be described by the following expression:

where Y is the dependent variable, X is the independent variable which is part of the data, β0 is the intercept which is unknown and constant, β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values. The main idea behind linear regression is to find the best-fitting straight line, the regression line, through a set of paired ( X, Y ) data. One example of the Linear Regression application is modeling the impact of Flipper Length on penguins’ Body Mass, which is visualized below.

Multiple Linear Regression with three independent variables can be described by the following expression:

Ordinary Least Squares

The ordinary least squares (OLS) is a method for estimating the unknown parameters such as β0 and β1 in a linear regression model. The model is based on the principle of least squares that minimizes the sum of squares of the differences between the observed dependent variable and its values predicted by the linear function of the independent variable, often referred to as fitted values. This difference between the real and predicted values of dependent variable Y is referred to as residual and what OLS does, is minimize the sum of squared residuals. This optimization problem results in the following OLS estimates for the unknown parameters β0 and β1 which are also known as coefficient estimates.

Once these parameters of the Simple Linear Regression model are estimated, the fitted values of the response variable can be computed as follows:

Standard Error

The residuals or the estimated error terms can be determined as follows:

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data. The OLS estimates the error terms for each observation but not the actual error term. So, the true error variance is still unknown. Moreover, these estimates are subject to sampling uncertainty. What this means is that we will never be able to determine the exact estimate, the true value, of these parameters from sample data in an empirical application. However, we can estimate it by calculating the sample residual variance

OLS Assumptions

OLS estimation method makes the following assumption which needs to be satisfied to get reliable prediction results:

A1: Linearity assumption states that the model is linear in parameters.

A2: Random Sample assumption states that all observations in the sample are randomly selected.

A3: Exogeneity assumption states that independent variables are uncorrelated with the error terms.

A4: Homoskedasticity assumption states that the variance of all error terms is constant.

A5: No Perfect Multi-Collinearity assumption states that none of the independent variables is constant and there are no exact linear relationships between the independent variables.

Note that the above description for Linear Regression is from my article named [Complete Guide to Linear Regression](pub.towardsai.net/complete-guide-to-linear-.. "pub.towardsai.net/complete-guide-to-linear-..")

[Complete Guide to Linear Regression
Everything you need to know about the simplest yet the most popular Machine Learning regression modelpub.towardsai.net](https://pub.towardsai.net/complete-guide-to-linear-regression-86c5eddb7eda "pub.towardsai.net/complete-guide-to-linear-..")

When the relationship between two variables is linear, then Linear Regression is a statistical method that can help to model the impact of a unit change in a variable, the independent variable on the values of another variable, the dependent variable.

Image Source: The Author

Logistic Regression

Another very popular Machine Learning technique is the Logistic Regression which, though named regression, is actually a supervised classification technique. Logistic regression is a Machine Learning method that models conditional probability of an event occurring or observation belonging to a certain class, based on a given dataset of independent variables.

When the relationship between two variables is linear, the dependent variable is a categorical variable, and you want to predict a variable in the form of a probability (number between 0 and 1) then Logistic Regression comes in handy. This is because during the prediction process in Logistic Regression, the classifier predicts the probability (a value between 0 and 1) of each observation belonging to the certain class, usually to one of the 2 classes of dependent variable.

For instance, if you want to predict the probability or likelihood of a candidate to be elected or not during the election given the candidates popularity score, past successes and other descriptive variables about a candidate, then Logistic Regression can be used to model this probability.

So, rather than predicting the response variable, Logistic Regression models the probability that Y belongs to a particular category. So, similar to Linear Regression with a difference that instead of Y it predicts the log odds. In statistical terminology, we model the conditional distribution of the response Y, given the predictor(s) X. Therefore, LR helps to predict the probability of Y belonging to certain class (0 and 1) given the features P(Y|X=x).

Logistic regression is a Machine Learning method that predicts the probability of an event occurring or observation belonging to a certain class, based on a given dataset of independent variables.

The name Logistic in Logistic Regression comes from the function this approach is based upon, which is Logistic Function. Logistic Function makes sure that for too large and too small values the corresponding probability is still within the [0,1 bounds].

where the P(X) stands for the probability of Y belonging to certain class (0 and 1) given the features P(Y|X=x). X stands for the independent variable, β0 is the intercept which is unknown and constant, β1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well similar to Linear Regression. e stands for exp() function.

So, rather than predicting the response variable, Logistic Regression models the probability that Y belongs to a particular category. So, similar to Linear Regression with a difference that instead of Y it predicts the log odds.

Image Source: LunarTech.ai

Odds and Log Odds

Logistic Regression and its estimation technique MLE is based on the terms Odds and Log Odds. Where Odds is defined as follows:

Odds ratio

and Log Odds is defined as follows:

Maximum Likelihood Estimation (MLE)

While for Linear Regression, we use OLS (Ordinary Least Squares) or LS (Least Squares) as estimation technique, for Logistic Regression another estimation technique should be used. The reason why we can’t use LS in Logistic Regression to find the best fitting line (to perform estimation) is because the errors can then become very large or very small even negative while in case of Logistic Regression we aim for predicted value in [0,1].

Therefore, for Logistic Regression we use MLE technique, where the likelihood function calculates the probability of observing the outcome given the input data and the model. This function is then optimised to find the set of parameters that results in the largest sum likelihood over the training dataset.

Image Source: The Author

The logistic function will always produce an S-shaped curve of the form like above, regardless of the value of independent variable X resulting in sensible estimation most of the time.

Image Source: LunarTech.ai

Logistic Regression Likelihood Function(s)

The Likelihood function can be expressed as follows:

Hence, the Log Likelihood function can be expressed as follows:

or, after transformation from multipliers to summation, we get:

Maximum Likelihood Estimation (MLE)

Then the idea behind the MLE is to find set of estimates that would maximize this likelihood function.

Step 1: Project data points into candidate line that produces sample log(odds) value.
Step 2: Transform sample log(odds) to sample probabilities by using the following formula:

Step 3: Obtain the overall likelihood or overall log likelihood.
Step 4: Rotate the log(odds) line again and again, until finding optimal log(odds) maximizing the overall likelihood

Cut off Value in Logistic Regression

If you plan to use Logistic Regression to at the end get a binary {0,1} value, then you need a cut-off point to transform the estimated values per observation in from the range of [0,1] to a value 0 or 1. Depending on your individual case, corresponding cut off point can be chosen, but a popular cut-ff point is the 0.5: that is, all observations with predicted value smaller than 0.5, will be assigned to class 0 and observations with predicted value larger or equal than 0.5, will be assigned to class 1.

Performance Metrics in Logistic Regression

Since Logistic Regression is a classification method, common classification metrics such as recall, precision, F-1 measure can all be used. However, there is also a metrics that is also commonly used for assessing the performance of the Logistic Regression model, called Deviance.

Logistic Regression: Python Implementation

import numpy as np
import statsmodels.api as sm
import pandas as pd

# simulating data
N = 1000
X = np.random.binomial(30,0.5,size = [N,1])
Y = np.random.randint(2,size = [N,1])
data = pd.DataFrame(np.append(X,Y,axis = 1))
data.columns = ["X","Y"]

# splitting the data to train and test
data_train = data.sample(int(N*0.8))
data_test = data.drop(index = data_train.index)
X_train = data_train["X"]
Y_train = data_train["Y"]
X_test = data_test["X"]
Y_test = data_test["Y"]

# Logistic regression model
logit_model = sm.Logit(Y_train,X_train)
fitted_model = logit_model.fit()
print(fitted_model.summary())
# predicted log odds
logodds_pred = fitted_model.predict(X_test)
# using the logistic function to transform log odds to probabilities
prob_pred = np.exp(logodds_pred)/(1+np.exp(logodds_pred))

# function to use the cut-off 0.5 for classification
def get_y_pred(p_x):
if p_x>=0.5:
y_pred = 1
else:
y_pred = 0
return(y_pred)

# function for calculating the deviance
def get_deviance(Y,Y_pred):
sum = 0
Y= np.array(Y)
Y_pred = np.array(Y_pred)
for i in range(len(Y)):
if Y[i] == Y_pred[i]:
sum += 1
return(sum/len(Y))
Y_pred = prob_pred.apply(get_y_pred)

# using the logistic function to transform log odds to probabilities
p_pred = np.exp(logodds_pred)/(1+np.exp(logodds_pred))
def get_y_pred(p_x):
if p_x>=0.5:
y_pred = 1
else:
y_pred = 0
return(y_pred)

Y_pred = pd.Series(np.array(p_pred.apply(get_y_pred)))

Image Source: LunarTech.ai

Linear Discriminant Analysis (LDA)

Another classification technique, highly related to Logistic Regression is the Linear Discriminant Analytics (LDA). Where Logistic Regression is usually used to model the probability of observation belonging to a class of the outcome variable with 2 categories, LDA is usually used to model the probability of observation belonging to a class of the outcome variable with 3 and more categories.

LDA offers an alternative approach to model the conditional likelihood of the outcome variable given that set of predictors that addresses the issues of Logistic regression. It models the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then uses Bayes’ theorem to flip these two around into estimates for Pr(Y = k|X = x).

Note that in case of LDA these distributions are assumed to be normal, it turns out that the model is very similar in form to logistic regression.

where π_k represents the overall prior probability that a randomly chosen observation comes from the kth class. f_k(x) , which is equal to Pr(X = x|Y = k), represents the posterior probability, is the density function of X for an observation that comes from the kth class (density function of the predictors).

This is the probability of X=x given the observation is from certain class. Stated differently, it is the probability that the observation belongs to the kth class, given the predictor value for that observation. Assuming that f_k(x) is Normal or Gaussian, the normal density takes the form (this is the one- normal dimensional setting).

where μ_k and σ_k² are the mean and variance parameters for the kth class. Assuming that σ_¹² = · · · = σ_K² (there is a shared variance term across all K classes, which we denote by σ2).

Then the LDA approximates the Bayes classifier by using following estimates for πk, μk, and σ2:

Where Logistic Regression is usually used to model the probability of observation belonging to a class of the outcome variable with 2 categories, LDA is usually used to model the probability of observation belonging to a class of the outcome variable with 3 and more categories

Image Source: LunarTech.ai

Logistic Regression vs LDA

Though Logistic regression is a popular approach for performing classification when there are two classes , but when the classes are well-separated or the number of classes exceeds 2, the parameter estimates for the logistic regression model are surprisingly unstable.

Unlike Logistic Regression, LDA does not suffer from instability problem when the number of classes is more than 2. If n is small and the distribution of the predictors X is approximately normal in each of the classes, LDA is again more stable than the Logistic Regression model.

5: Naive Bayes

Another classification method that relies on Bayes Rule, like LDA, is Naive Bayes Classification approach. For more about Bayes Theorem, Bayes Rule and a corresponding example, read here. Like Logistic Regression, Naive Bayes approach can be used to classify observation to one of the two classes (0 or 1).

The idea behind this method is to calculate the probability of observation belong to a class given prior probability for that class and conditional probability of each feature value given for given class. That is:

where Y stands for the class of observation, k is the kth class and x1, …, xn stands for feature 1 till feature n, respectively. f_k(x) = Pr(X = x|Y = k), represents the posterior probability, which like in case of LDA is the density function of X for an observation that comes from the kth class (density function of the predictors).

If you compare the above expression with teh one you sow for LDA, you will see some similarities. In LDA, we make a very important and strong assumption for simplification purpose. Namely, we assume that f_k is the density function for a multivariate normal random variable with class-specific mean μ_k, and shared covariance matrix Sigma Σ. This assumtion helps to replace the very challenging problem of estimating K p-dimensional density functions with the much simpler problem, which is to estimate K p-dimensional mean vectors and one (p × p)-dimensional covariance matrices.

Image Source: LunarTech.ai

In case of Naive Bayes Classifier, uses a different approach for estimating f_1 (x), . . . , f_K(x). Instead of making assumption that these functions belong to a particular family of distributions (e.g. normal or multivariate normal), we instead make a single assumption: within the kth class, the p predictors are independent. That is:

That is, Bayes classifier assumes that the value of a particular variable or feature is independent of the value of any other variables (uncorrelated), given the class/label variable. For instance, a fruit may be considered to be a banana if it is yellow, oval shaped, and about 5–10 cm long. So, Naive Bayes classifier considers each of these various features of fruit contribute independently to the probability that this fruit is a banana, independent of any possible correlation between the colour, shape, and length features.

Naive Bayes Estimation

Like Logistic Regression, in case of Naive Bayes classification approach we use Maximum Likelihood Estimation (MLE) as estimation technique. There is a great article providing detailed, coincise summary for thos approach with corresponding example which you can find here.

Naive Bayes vs Logistic Regression

Naive Bayes Classifier has proven to be faster, have a higher bias and lower variance while Logistic regression has low bias and higher variance. Depending on your individual case, and the bias variance trade-off, you can pick the corresponding approach.

Decision Trees

Decision Trees are a supervised and non-parametric Machine Learning learning method used for both classification and regression purposes. The idea is to create a model that predicts the value of a target variable by learning simple decision rules from the data predictors.

Unlike Linear Regression, or Logistic Regression, Decision Trees are simple and useful model alternatives when the relationship between independent variables and dependent variable is suspected to be non-linear.

Tree-based methods stratify or segment the predictor space into smaller regions. The idea behind building Decision Trees is to divide the predictor space into distinct and mutually exclusive regions X1,X2,….. ,Xp → R_1,R_2, …,R_N where the regions are in the form of boxes or rectangles. These regions are found by recursive binary splitting since minimizing the RSS is not feasible. This approach is often referred to as greedy approach.

Decision trees are build by top-down splitting. So, in the beginning, all observations belong to a single region. Then, the model successively splits the predictor space; each split is indicated via two new branches further down on the tree. This approach is sometimes called greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

Stopping Criteria

There are usually two common stopping criteria used when building Decision Trees.

Minimum number of observations in the leaf.
Minimum number of samples for a node split.
Maximum depth of tree (vertical depth).
Maximum number of terminal nodes.
Maximum features to consider for the split.

For example, repeat this splitting process until no region contains more than 100 observations.

When building a decision tree, especially when dealing with large number of features, the tree can become too big with too many leaves, which will effect the interpretability of the model, and might potentially result in overfitting problem. Therefore, picking a good stopping criteria is essential for the interpretability and for the performance of the model.

Image Source: LunarTech.ai

RSS/Gini Index/Entropy/Node Purity

When building the tree we use RSS(for Regression Trees) and GINI Index/Entropy (for Classification Trees) for picking the predictor and value for splitting the regions. Both Gini Index and Entropy are often called Node Purity measures because they describe how pure the leaf of the trees are.

Gini Index

Gini index measures the total variance across K classes. It takes small value when all class error rates are either 1 or 0 which is also the reason why it’s called a measure for node purity because Gini index takes small values when the nodes of the tree contain predominantly observations from the same class. Gini index is defined as follows:

where pˆmk represents the proportion of training observations in the mth region that are from the kth class.

Entropy

Entropy is another node purity measure, and like the Gini index, the entropy will take on a small value if the mth node is pure. In fact, the Gini index and the entropy are quite similar numerical and can be expressed as follows:

Decision Tree Classification Example

Let’s look at an example where we have three features describing consumers past behaviour:

Recency (How recent was the customer’s last purchase?)
Monetary (How much money did the customer spend in a given period?)
Frequency (How often did this customer make a purchase in a given period?)

We will use the classification version of Decision Tree to classify customers to 1 of the 3 classes (Good: 1, Better: 2 and Best: 3), given the features describing the customer behaviour. In the following tree where we use Gini Index as a purity measure, we see that the first features that sems to be the most important one is the Recency. We can interpret the following decision tree as follows.

Customers who have recency 202 or larger (last time has made a purchase > 202 days ago) then the chance of this observation to be assigned to class 1 is 93% (basically, we can label those customers as Good Class customers). For customers with Recency smaller than 202 (they made a purchase recently), we look at their Monetary value and if iit smaller than 1394, then we loook at their Frequency. If the Frequency is then smaller than 44, we can then label this customers’ class as Better or (class 2).

Image Source: The Author

Decision Trees Python Implementation

# simulating data
num_rows = 1000
num_columns = 3
X1 = pd.Series(np.ones(num_rows))
X2 = pd.DataFrame(np.random.normal(mean,scale, size = [num_rows, num_columns]))
Y = pd.Series(np.random.choice([1,2,3],num_rows))

# merging/append datasets or columns next to each other using pd.concat
df1 = pd.concat([X1,X2,Y], axis = 1)
df1.columns = ["X1","X2","X3","X4","Y"]

# inner join of 2 datasets
df2 = pd.DataFrame({"Y":[1,2,3], "X5": [3,6,9]})
df = df1.merge(df2, on = 'Y', how = 'inner')

X1 X2 X3 X4 Y X5
0 1.0 14.135206 12.563534 15.018688 3 9
1 1.0 12.594167 3.070420 6.884480 3 9
2 1.0 1.868650 5.727845 20.188167 3 9
3 1.0 18.841369 -0.428457 14.008933 3 9
4 1.0 28.635733 14.432839 2.205097 3 9
.. ... ... ... ... .. ..
995 1.0 10.232617 9.435407 4.640731 1 3
996 1.0 5.959808 7.841094 6.163875 1 3
997 1.0 3.969548 17.685926 9.017210 1 3
998 1.0 11.010778 10.337888 6.393183 1 3
999 1.0 16.734656 7.925074 10.346089 1 3

from sklearn import tree
from sklearn.model_selection import train_test_split
X = df[["X1","X2","X3","X4","X5"]]
Y = df["Y"]
X_train,X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2)

clf = tree.DecisionTreeClassifier()
clf_fitted = clf.fit(X_train,Y_train)
clf_pred = clf_fitted.predict(X_test)
# or get probabilities of belonging to each of the classes
clf_pred_prob = clf_fitted.predict_proba(X_test)
# lot the tree
tree.plot_tree(clf_fitted, filled=True)

Image Source: The Author

Unlike Linear Regression, or Logistic Regression, Decision Trees are simple and useful model alternatives when the relationship between independent variables and dependent variable is suspected to be non-linear.

Image Source: LunarTech.ai

Bagging

One of the biggest disadvantages of Decision Trees is its high variance. You might end up with a model and predictions that is easy to explain but misleading. Which would result in making wrong conclusions and business decisions. For that purpose, to reduce the variance of the Decision trees, a method called Bagging can be used.

To understand what Bagging is, there are two terms you need to know:

Bootstrapping
Central Limit Theorem (CLT)

You can find more about Boostrapping, which is a resampling technique, later in this article. For now, you can think of Bootstrappng as a technique that performs sampling from the original data with replacement, which creates a copy of the data very similar but not exactly the same as the original data.

Bagging is also based on the same ideas as the CLT which is one of the most important if noot the most important theorem in Statistics. You can read in more detail about CLT here. But the idea that is also used in Bagging is that if you take the average of many samples then the variance is significantly reduced compared to the variance of each of the individual sample based models. So, given a set of n independent observations Z1,…,Zn, each with variance σ2, the variance of the mean Z ̄ of the observations is given by σ2/n. Therefore, averaging a set of observations reduces variance.

For more Statistical details, check out the following blog.

[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")

Bagging is basically a Bootstrap aggregation that builts B trees using B Bootrsapped samples. Bagging can be used to improve the precision (lower the variance of many approaches) by taking repeated samples from a single training data.

So, in Bagging, we generate B bootstrapped training samples, based on which B similar trees (correlated trees) are built that end up being aggregaated to calculate the predictions, so taking the average of these predictions for these B-samples. Notably, each tree is built on a bootstrap data set, independent of the other trees.

So, in case of Bagging in each tree split all p features are considered which results in similar trees wince every time the strongest predictors are at the top and weak ones at the bottom resulting all of the bagged trees will look quite similar to each other.

Bagging in Regression Trees

To apply bagging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep, and are not pruned. Hence each individual tree has high variance, but low bias. Averaging these B trees reduces the variance.

Bagging in Classification Trees

For a given test observation, we can record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring majority class among the B predictions.

OOB Out-of-Bag Error Estimation

In case Bagging is applied to decision trees there is no longer a need to apply Cross Validation to estimate the test error rate. In bagging we repeatedly fit the trees to Bootstrapped samples and it’s known that on average only 2/3 of these observation are used and 1/3 is not used during the training process: these are called Out-of-bag observations.

So there are in total B/3 prediction per ith observation not used in training, so we can take the average of response values for these cases (or majority class). So per observation the OOB error and average of these forms the test error rate.

Bagging Python Implementation

from sklearn.ensemble import BaggingClassifier
bagging_model = BaggingClassifier(n_estimators=10, random_state=0)
begging_fit = bagging_model.fit(X_train,Y_train)
begging_pred = begging_fit.predict(X_test)

Bagging is basically a Bootstrap aggregation that builts B trees using B Bootrsapped samples. Bagging can be used to improve the precision (lower the variance of many approaches) by taking repeated samples from a single training data.

Image Source: LunarTech.ai

Random Forest

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.

The split is allowed to use only one of those m predictors. A fresh and random sample of m predictors is taken at each split, and typically we choose m ≈ √p — that is, the number of predictors considered at each split is approximately equal to the square root of the total number of predictors.This is also the reason why Random Forest is called “random”.

The main difference between bagging and random forests is the choice of predictor subset size m decorrelates the trees.

Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors. So, if you have a problem of Multicollearity, RF is a good method to fix that problem.

So, unlike in Bagging, in case of Random Forest, in each tree split not all p preedictors are considered but only randomly selected m predictors from it, resulting in not similar trees decorrelateed trees**. And due to the fact that averaging decorrelated trees results in smaller variance, Random Forest is more accurate than Bagging.

Random Forest Python Implementation

For implementing Random Forest in Python we will use the same labelled data as in case of Bagging with the same X_train, Y_train, and X_test

from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=10, random_state=0)

#fitting the model on the training data
rf_fit = rf_model.fit(X_train, Y_train)

#using fitted model to get the predictions for test data
rf_pred = rf_fit.predict(X_test)

Unlike in Bagging, in case of Random Forest, in each tree split not all p preedictors are considered but only randomly selected m predictors from it, resulting in not similar trees decorrelateed trees. And due to the fact that averaging decorrelateeed trees results in smaller variance, Random Forest is more accurate than Bagging.

Image Source: LunarTech.ai

Boosting or Ensemble Models

Like Bagging (averaging correlated Decision Trees) and Random Forest (averaging uncorrelated Decision Trees), Boosting aims to improve the predictions resulting from a decision tree. Boosting is a supervised Machine Learning model that can be used for both regression and classification problems.

Unlike Bagging or Random Forest, where the trees are built independently from each other using one of the B bootstrapped samples (copy of the initial training date), in Boosting, the trees are built sequentially and dependent of each other; each tree is grown using information from previously grown trees.

Boosting does not involve bootstrap sampling; instead, each tree fits on a modified version of the original data set. It’s a method of converting weak learners into strong learners. In boosting, each new tree is a fit on a modified version of the original data set. So, unlike fitting a single large decision tree to the data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly.

Given the current model, we fit a decision tree to the residuals from the model. That is, we fit a tree using the current residuals, rather than the outcome Y, as the response. We then add this new decision tree into the fitted function in order to update the residuals. Each of these trees can be rather small, with just a few terminal nodes, determined by the parameter d in the algorithm.

Like Bagging (averaging correlated Decision Trees) and Random Forest (averaging uncorrelated Decision Trees), Boosting aims to improve the predictions resulting from a decision tree.

Image Source: LunarTech.ai

Boosting Algorithm 1: AdaBoost

First Ensemble algorithm we will look into today is AdaBoost. Like in all boosting techniques in the case of AdaBoost also the trees are built using the information from the previous tree and more specifically part of the tree which didn’t perform well, called the weak learner (Decision Stump). This Decision Stump is built using only a single predictor and not all predictors to perform the prediction.

So, AdaBoost combines weak learners to make classifications and each stump is made by using the previous stump’s errors. Here is the step-by-step plan for building an AdaBoost model:

Step 1: Initial Weight Assignment: assign equal weight to all observations in the sample where this weight represents the importance of the observations being correctly classified: 1/N (all samples are equally important at this stage).
Step 2: Optimal Predictor Selection: The first stamp is built by obtaining the RSS(in case of regression) or GINI Index/Entropy (in case of classification) for each predictor. Picking the stump that does the best job in terms of prediction accuracy: the stump with the smallest RSS or GINI/Entropy is selected as the next tree.
Step 3: Computing Stumps Weight based on Stumps Total Error: The importance of this stump in the final tree is then determined using the total error that this stump is making. Where a stump that is not better than random flip of a coin with total error equal to 0.5 gets weight 0. Weight = 0.5*log(1-Total Error/Total Error)
Step 4: Updating Observation Weights: We increase the weight of the observations which have been incorrectly predicted and decreasing the remaining observations which had higher accuracy or have been correctly classified, s.t. the next stump will have higher importance of correctly predicted the value f this observation.
Step 5: Building the next Stump based on updated weights: Using Weighted Gini index to chose the next stump.
Step 6: Combining B stumps: Then all the stumps are combined while taking into account their importance, weighted sum.

Simple AdaBoost Python Implementation

from sklearn.ensemble import AdaBoostClassifier
adb_model = AdaBoostClassifier()
adb_fit = adb_model.fit(X_train, Y_train)
adb_pred = adb_fit.predict(X_test)

So, AdaBoost combines weak learners to make classifications and each stump is made by using the previous stump’s errors.

Image Source: LunarTech.ai

Boosting Algorithm 2: Gradient Boosting Model (GBM)

AdaBoost and Gradient Boosting are very similar to each other. However,compared to the AdaBoost which starts the process by selecting a stump and continuing to build it by using the weak learners from the previous stump, Gradient Boosting starts with a single leaf instead of a tree of a stump.

The outcome corresponding to this chosen leaf is then an initial guess for the outcome variable. Like in the case of AdaBoost, Gradient Boosting uses the previous stump’s errors to build the tree, but unlike the AdaBoost, the trees that Gradient Boost builds are larger than a stump. That’s a parameter we set a max number of leaves.

To make sure the tree is not overfitting the Gradient Boosting is using Learning Rate to scale the gradient contributions. Gradient Boosting is based on the idea that taking lots of small steps in the right direction (gradients) will result in lower variance (for testing data).

The major difference between AdaBoost and Gradient Boosting Algorithm is how the two algorithms identifies the shortcomings of weak learners (eg. decision trees). While the AdaBoost model identifies the shortcomings by using high weight data points, gradient boosting performs the same by using gradients in the loss function (y=ax+b+e , e needs a special mention as it is the error term).

The loss function is a measure indicating how good are model’s coefficients are at fitting the underlying data. A logical understanding of loss function would depend on what we are trying to optimise.

Like in the case of AdaBoost, Gradient Boosting uses the previous stump’s errors to build the tree, but unlike the AdaBoost, the trees that Gradient Boost builds are larger than a stump.

Early Stopping

The special process of tuning the number of iterations for an algorithm such as GBM and Random Forest is called “Early Stopping”, a phenomenon we touched upon when discussing the Decision Trees. Early Stopping performs model optimisation by monitoring the model’s performance on a separate test data set and stopping the training procedure once the performance on the test data stops improving beyond a certain number of iterations.

To make sure the tree is not overfitting the Gradient Boosting is using Learning Rate to scale the gradient contributions. Gradient Boosting is based on the idea that taking lots of small steps in the right direction (gradients) will result in lower variance (for testing data).

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.

In the context of GBM, early stopping can be based either on an out of bag sample set (“OOB”) or cross-validation (“CV”). Like mentioned earlier, the ideal time to stop training the model is when the validation error has decreased and started to stabilise before it starts increasing due to overfitting.

To build GBM following step-by-step process can be used:

Step 1: Train the model on the existing data, to predict the outcome variable
Step 2: Compute error rate using the predictions and the real values (Pseudo Residual)
Step 3: Use the existing features and the Pseudo Residual as the outcome variable to predict the residuals again
Step 4: We use the predicted residuals to update the predictions from the Step 1, while scaling this contribution to the tree with a learning rate (hyper parameter)
Step 5: Repeat steps 1–4, the process of updating the pseudo residuals and the tree while scaling with the learning rate, to move slowly in the right direction until there is no longer an improvement or our stopping rule

The idea is that each time we add new scaled tree to the model, the residuals should get smaller.

At any m step Gradient Boosting model produces a model that is an ensemble of the previous step F(m-1) and learning rate eta multiplied with the negative derivative of the loss function w.r.t the output of the model at step m-1: (weak learner at step m-1).

Simple GBM Python Implementation

from sklearn.ensemble import GradientBoostingClassifier
gbm_model = GradientBoostingClassifier()
gbm_fit = gbm_model.fit(X_train, Y_train)
gbm_pred = gbm_fit.predict(X_test)

Image Source: LunarTech.ai

While the AdaBoost model identifies the shortcomings by using high weight data points, Gradient Boosting performs the same by using gradients in the loss function.

Boosting Algorithm 3: XGBoost

One of the most popular Boosting or Ensemble algorithms is Extreme Gradient Boosting (XGBoost). The difference between the GBM and XGBoost is that in case of XGBoost the second-order derivatives are calculated (second-order gradients) which provides more information about the direction of gradients and how to get to the minimum of the loss function. Rememeber that this is needed to identify the weak learner and improve the model by improving the weak learners.

The idea behind the XGBoost is that the 2nd order derivative tend to be more precise in terms of finding the accurate diraction. Like the AdaBoost, XGBoost applies advanced regularization in the form of L1 or L2 norms to address overfitting.

Unlike the AdaBoost, XGBoost is parallelizable due to its special cashing mechanism, making it convenient to handle large and complex datasets. Also, to speed up the training, XGBoost uses Approximate Greedy Algorithm to consider only limited amount of tresholds for splitting the nodes of the trees. To build XGBoost model following step-by-step process can be used:

Step 1: Fit a Single Decision Tree: In this step Loss function is calculated, for example NDCG to evaluate the model.
Step 2: Add the Second Tree: This is done such that when this second tree is added to the model it lowers the Loss function based on 1st and 2nd order derivatives compared to the previous tree where we also used learning rate eta.
Step 3: Finding the Direction of the Next Move: Using the first degree and second-degree derivatives we can find the direction in which the Loss function decreases the largest which is basically the gradient of the Loss function w.r.t to the output of the previous model.
Step 4: Splitting the nodes: To split the observations XGBoost is uses Approximate Greedy Algorithm (about 3 approximate weighted quantiles usually) quantiles that have a similar sum of weights. For finding the split value of the nodes, is not considering all the candidate thresholds but instead it uses the quantiles of that predictor only.

Optimal Learning Rate can be determined by using Cross Validation & Grid Search.

Unlike the AdaBoost, XGBoost is parallelizable due to its special cashing mechanism, making it convenient to handle large and complex datasets. Also, to speed up the training, XGBoost uses Approximate Greedy Algorithm to consider only limited amount of tresholds for splitting the nodes of the trees.

from xgboost import XGBRegressor
xgboost_model = XGBRegressor()
xgboost_fit = xgboost_model.fit(X_train, Y_train)
xgboost_pred = xgboost_fit.predict(X_test)

Image Source: LunarTech.ai

Feature Selection

In Machine Learning, in many cases we are dealing with large amount of features and not all of them are usually important and informative for the model. Including such irrelevant variables in the model leads to unnecessary complexity in the Machine Learning model and effects the model interpretability but also the performance of the model. By removing these unimportant variables, and selecting only relatively informative features we can get a model which can be easier to interpret and possibly with better precision.

Let’s look at a specific example of Machine Learning model for simplicity sake. Let’s assume that we are looking at a Multiple Linear Regression model (multiple independent variables and single response/dependent variable)with very large number of features. This model is likely to be complex when it comes to interpreting it and on the top of that, it might be result in inaccurate predictions since some of those features might be unimportant and not helping to explain the response variable.

The process of selecting important variables in the model is called feature selection or variable selection. This process involves identifying a subset of the p variables that we believe to be related to the dependent or the response variable. For this, we need to run the regression for all possible combinations of independent variables and select one that results in best performing model or the worst performing model.

There are various approaches that can be used for Features Selection, usually categories in the following 3 categories :

Subset Selection (Best Subset Selection, Step-Wise Feature Selection)
Regularisation Techniques (L1 Lasso)
Dimensionality Reduction Techniques (PCA)

Image Source: LunarTech.ai

Step-Wise Feature Selection

One of the popular ones is the Step-Wise Feature Selection Technique. Let’s look at two different step-wise feature selection methods:

Forward Step-wise Selection
Backward Step-wise Selection

1: Forward Step-Wise Selection

What Forward Step-Wise Feature Selection technique does is it starts with an empty Null model with only an intercept. We then run a set of simple regressions and picks the variable which has a model with the smallest RSS (Residual Sum of Squares). Then the same is done with 2 variable regressions and continues until it’s completed.

So, Forward Step-Wise Selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

Forward Step-Wise Selection can be summarized as follows:

Step 1: Let M_0 be the null model, containing no features

Step 2: For K = 0,…., p-1;

Consider all (p-k) models that contain the variables in M_k with one additional feature or predictor
Choose the best model among these p-k models, and define it M_(k+1) by using performance metrics such as RSS/R-squared

Step 3: Select the single model with the best performance among these M_0,….M_p models (one with smallest Cross Validation Error, C_p, AIC (Akaike Information Criterion), BIC (Bayesian Information Criteria)or adjusted R-squared is your best model M*.)

So, the idea behind this Selection is to start simple and increase the number of predictors in the model. Per number of predictors consider all possible combination of variables and select a single best model: M_k. Then compare all these models with different number of predictors (best M_ks ) and the one best performing one can be selected.

When n < p, so when number of observations is larger than number of predictors in Linear Regression, this approach can be used to select features in the model in order for LR to work in the first place.

2: Backward Step-wise Feature Selection

Unlike in Forward Step-wise Selection, in case of Backward Step-wise Selection the feature selection algorithm starts with the full model containing all p predictors. Then the best model with p predictorss is selected. Consequently, the model removes one by one the variable with the largest p-value and again best model is selected.

Each time the model is fitted again to identify the least statistically significant variable until the stopping rule is reached. (e.g. all p- values need to be smaller then 5%). Then we compare all these models with different number of predictors (best M_ks) and select the single model with the best performance among these M_0,….M_p models (one with smallest Cross Validation Error, C_p, AIC (Akaike Information Criterion), BIC (Bayesian Information Criteria)or adjusted R-squared is your best model M*).

Backward Step-Wise Feature Selection can be summarized as follows:

Step 1: Let M_p be the full model, containing all features

Step 2: For k= p, p-1 ….,1;

Consider all k models that contain all variables except of one of the predictors in M_k model, for k − 1 features
Choose the best model among these k models, and define it M_(k-1) by using performance metrics such as RSS/R-squared

Like Forward Step-wise Selection, the Backward Step-Wise Feature Selection technique searches through only 1+p(p+1)/2 models, making it possible to apply in settings where p is too large to apply other selection techniques.

Additionally, Forward Step-wise Selection, Backward Step-Wise Feature Selection is not guaranteed to yield the best model containing a subset of the p predictors. It requires that the number of observations or data points n to be larger than the number of model variables p whereas Forward Step-Wise Selection can be used even when n < p.

Forward Step-Wise Feature Selection technique does not require the number of observations or data points n to be larger than the number of model variables p, therefore it can be used even when n < p.

Image Source: LunarTech.ai

Resampling Techniques

When we have only training data and we want to make judgments about the performance of the model on unseen data, we can use Resampling Techniques to create artificial test data. Resampling Techniques which are often divided into two categories: Cross-Validation and Bootstrapping are usually used for the following three purposes:

Model Assessment: evaluate the model performance (to compute test error rate)
Model Variance: compute the variance of the model to check the generalizable your model is
Model Selection: select model flexibility

For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ.

Cross-Validation

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to perform

Model assessment: to evaluate its performance by calc test error rate
Model Selection: to select the appropriate level of flexibility.

Holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations. CV is usually divided in the following three categories:

Validation Set Approach
K-fold Cross Validation (K-ford CV)
Leave One Out Cross Validation (LOOCV)

Validation Set Approach

A simple approach to randomly split the data into training and validation sets. This is the simplest approach often used to split the data into train and test sets, usually using Sklearn’s train_test_split() function.

The model is then trained on the training data (usually 80% of the data) and use it to predict the values for the hold-out or Validation Set (usually 20% of the data) which is the test error rate.

Conceptually Simple
Easy to interpret

Validation Set Approach Cons

High variance: Test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set given the randomness of the split
High bias: The trained/fitted model might be less accurate given that smaller training data is used, hence the validation error might be overestimated

A simple approach to randomly split the data into training and validation sets. This is the simplest approach often used to split the data into train and test sets, usually using Sklearn’s train_test_split() function.

Leave One Out Cross Validation (LOOCV)

LOOCV is similar to the Validation set approach but what it does differently is that each time it leaves one observation out of the training set and uses the remaining n-1 to train the model and calculates the MSE for that one prediction.So, in case of LOOCV the Model has to be fit n times. Where n is the number of observations in the model.

Then this process is repeated for all observations and n times MSEs are calculated. The mean of the MSEs is the Cross-Validation error rate and can be expressed as follows:

LOOCV is similar to the Validation set approach but what it does differently is that each time it leaves one observation out of the training set and uses the remaining n-1 to train the model and calculates the MSE for that one prediction.So, in case of LOOCV the Model has to be fit n times.

K-fold Cross Validation (K-ford CV)

K-Fold CV is the silver lining between the Validation Set approach( high variance and high bias but is computationally efficient) versus the LOOCV (low bias and low variance but is computationally inefficient).

In K-Fold CV the data is randomly sampled into K equally sized samples (K- folds) and then each time 1 is used as validation and the rest as training and the model is fit K times. The mean of K MSEs form the Cross validation test error rate. Note that the LOOCV is a special case of K-fold CV where K = N, and can be expressed as follows:

K-Fold CV is the silver lining between the Validation Set approach( high variance and high bias but computationally efficient) versus the LOOCV (lower bias and lower variance bur computationally inefficient).

Image Source: LunarTech.ai

Selecting Optimal k in K-fold CV

The choice of k in K-fold is a matter of Bias-Variance Trade-Off and efficiency of the model. Usually, K-Fold CV and LOOCV provide similar results and their performance can be evaluated using simulated data.

[Bias-Variance Trade-Off, Overfitting and Regularization in Machine Learning
Introduction to bias-variance trade-off, overfitting & how to solve overfitting using regularization: Ridge and Lasso…towardsdatascience.com](https://towardsdatascience.com/bias-variance-trade-off-overfitting-regularization-in-machine-learning-d79c6d8f20b4 "towardsdatascience.com/bias-variance-trade-..")

However, LOOCV has lower bias (unbiased) compared to K-fold CV because LOOCV uses more training data than K-fold CV does. But LOOCV has higher variance than K-fold does because LOOCV is fitting the model on almost identical data for each item and the outcomes are highly correlated than the outcomes of K-Fold which are less correlated. Since the mean of highly correlated outcomes has higher variance than the one of less correlated the LOOCV variance is higher.

K = N (LOOCV) , larger the K→ higher variance and lower bias
K = 1, smaller the K → lower variance and higher bias

While taking this information into account, what we can do is to calculate the performance of the model for various Ks lets say K = 3,5,6,7…,10 or the Type I, Type II, and total model classification error in case of classification model. Then the best performing models’ K can be the optimal K using the idea of ROC curve(classification case) or the Elbow method(regression case).

The choice of k in K-fold is a matter of Bias-Variance Trade-Off and efficiency of the model. Usually, K-Fold CV and LOOCV provide similar results and their performance can be evaluated using simulated data.

Image Source: LunarTech.ai

Bootstrapping

Bootstrapping is another very popular resampling technique that is used for various purposes one of which is to effectively estimate the variability of the estimates/models or to create artificial samples from an existing sample and improve model performance like in case of Bagging or Random Forest.

It is used in many situations in which it is hard or even impossible to directly compute the standard deviation of a quantity of interest.

A very useful way to quantify the uncertainty associated with the statistical learning method. To obtain the standard errors/measure of variability
Not useful for Linear Regression since the standard R/Python provides these results (SE of coefficients)

Boostrapping is Extremely handy for other methods as well where variability is more difficult to quantify. The bootstrap sampling is performed with replacement, which means that the same observation can occur more than once in the bootstrap data set.

So, what Bootstrap does is take the original training sample and resamples from it by replacement, resulting in B different samples. Then for each of these simulated samples, the coefficient estimate is computed, then by taking the mean of these coefficient estimates and using the common formula for SE we calculate the Standard Error of the Bootstrapped model.

Optimization Techniques

Knowing the fundamentals of the Machine Learning models and how to train the model in a programming language of your choice such as Python is definitely big part of becoming technical Data Scientist. However, that’s only a part of the job.

In order to use the Machine Learning model to solve a business problem, you need to optimize it after you have established its baseline. That is, you need to optimize the set of hyper parameters in your Machine Learning model, in order to find the set of optimal parameters that result in the best performing model all things equal.

So, to optimize or to tune your Machine Learning model, you need too perform hyperparameter optimization. By finding the optimal combination of hyper parameter values, we can decrease the error the model produces and build the most accurate model.

A model hyperparameter is a constant in the model that is external to the it and whose value cannot be estimated from data but should be specified in advanced before the model is trained. For instance, k in k-Nearest Neighbors (kNN)or the number of hidden layers in Neural Networks. Hyperparameter optimization methods are usually chategorized into:

Exhaustive Search or Brute Force Approach (e.g. Grid Search )
Gradient Descent (Batch GD, SGD, SDG with Momentum, Adam)
Genetic Algorithms

In this article, I will discuss only the first two types of optimisation techniques.

Brute Force Approach (Grid Search)

The Exhaustive Search often referred as Grid Search or Brute Force Approach is the process of looking for the most optimal hyperparameters by checking each of the candidates for the hyperparameters and compute the model error rate.

Once we create the list of possible values for each of the hyperparameters, for every possible combination of hyper parameter values, we calculate the model error rate and compare it to the current optimal model (one with minimum error rate). During each iteration, the optimal model is updated if the new parameter values result in lower error rate.

The optimisation method is simple. For instance, if you are working with a K-means clustering algorithm, you can manually search for the right number of clusters. However, if there are hundreds or thousands of possible combination of hyperparameter values that you have to consider, the model can take hours or days to train and it becomes incredibly heavy and slow. Therefore, most of the time, brute-force search is inefficient.

So, to optimize or to tune your Machine Learning model, you need too perform hyperparameter optimization. By finding the optimal combination of hyper parameter values, we can decrease the error the model produces and build the most accurate model.

When it comes to Gradient Descent type of optimisation techniques, then its variants such as Batch Gradient Descent, Stochastic Gradient Descent, etc differ in terms of the amount of data used to compute the gradient of the Loss or Cost function. Lets define this Loss Function by J(θ) where θ (theta) represents the parameter we want to optimize.

The amount of data usage is about a trade-off between the accuracy of the parameter update and the time it takes to perform such update. Namely, the larger data sample we use we can expect more accurate adjustment of a parameter but the process will be then much slower. The opposite holds true as well. The smaller is the data sample, less accurate will be the adjustments in the parameter but much faster will be the process.

Gradient Descent Optimization (GD)

Batch Gradient Descent algorithm or often refers just as Gradient Descent (GD), computes the gradient of the Loss Function J(θ) with respect to the target parameter using the entire training data.This is being done by first predicting the values for all observations in each iteration, and comparing it to its given value in the training data. These two values are used to calculate the prediction error term per observation which is then used to update the model parameters. This process continues until the model converges.

The gradient or the first order derivative of the loss function can be expressed as follows:

Then, this gradient is used to update the previous iterations’ value of the target parameter. That is:

where η (eta) is the learning rate of the GD Optimization process and ccan be [0,1] but is is usually a number between (0.001 and 0.04). There are two major disadvantages to GD which make this optimization technique not so popular especially when dealing with large and complex datasets. Since in each iteration the entire training data should be used and stored, the computation time can be very large resulting in incredibly slow process. n the top of that, storing that large amount of data results in memory issues making GD computationally heavy and slow.

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) method, also known as Incremental Gradient Descent, is an iterative approach for solving optimisation problems with a differential objective function, exactly like GD.

But unlike GD, SGD doesn’t use the entire batch of training data to update the parameter value in each iteration. The SGD method is often referred as the stochastic approximation of the gradient descent which aims to find the extreme or zero points of the stochastic model containing parameters that cannot be directly estimated.

What SGD does is minimising this cost function by sweeping through data in the training dataset and updating in every iteration the values of parameters. In SGD all model parameter are improved in each iteration step with only one training sample. So, instead of going through all training samples at once to modify model parameters, SGD algorithm improves parameters by looking at a single and randomly sampled training set (hence the name Stochastic). That is:

This single-step improves the speed of the process of finding the global minima of the optimization problem and this is what differentiate SGD from GD. So, SGD consistently adjusts the parameters with an attempt to move in the direction of the global minimum of the objective function.

SGD addresses the issue of GD of slow computation time, because it scales well with both big data and with a size of the model. However, even though SGD method itself is simple and fast, it is known as a “bad optimizer” because it is prone to finding local optimum instead of a global optimum.

Image Source: LunarTech.ai

In SGD all model parameter are improved in each iteration step with only one training sample. So, instead of going through all training samples at once to modify model parameters, SGD algorithm improves parameters by looking at a single training sample. This single-step improves the speed of the process of finding the global minima of the optimization problem and this is what differentiate SGD from GD.

SGD Momentum

When the error function is complex and non-convex, instead of finding the global optimum, the SGD algorithm mistakenly moves in the direction of numerous local minima, resulting in higher computation time. In order to address this issue and further improve the SGD algorithm, various methods have been introduced. One popular way of escaping a local minimum and moving right in direction of global minimum is SGD with Momentum has been intrduced.

The goal of the SGD method with momentum is to accelerate gradient vectors in the direction of the global minimum resulting in faster convergence. The idea behind the momentum is that the model parameters are learned by using the directions and values of previous parameter adjustments. Moreover, the adjustment values are calculated in such a way that more recent adjustments are weighted heavier (they get larger weights) compared to the very early adjustments (they get smaller weights).

The reason for this difference is that with the SGD method we do not determine the exact derivative of the loss function, but we estimate it on a small batch. Since the gradient is noisy, it is likely that it will not always move in the optimal direction.

The momentum helps then to estimate those derivatives more accurately, resulting in better direction choices when moving towards the global minimum. Another reason for the difference in the performance of classical SGD and SGD with momentum lies in the area referred as Pathological Curvature, also called ravine area*.*

Pathological Curvature or Ravine Area can be represented by the following graph where the orange line represents the path taken by the method based on the gradient while the dark blue line represents the ideal path in towards the direction of nding the global optimum.

Image Source: The author

To visualise the difference between the SGD and SGD Momentum lets look at the following figure. In the left hand-side is SGD method without Momentum.

Image Source: The author

In the right hand-side is the SGD with Momentum. The orange pattern represents the path of the gradient in a search of the global minimum.

The idea behind the momentum is that the model parameters are learned by using the directions and values of previous parameter adjustments. Moreover, the adjustment values are calculated in such a way that more recent adjustments are weighted heavier (they get larger weights) compared to the very early adjustments (they get smaller weights).

Image Source: LunarTech.ai

Adam Optimizer

Another popular technique for enhancing SGD optimization procedure is the Adaptive Moment Estimation (Adam) introduced by Kingma and Ba (2015). Adam is the extended version of the SGD with momentum method.

The main difference of it compared to the SGD with momentum, which uses a single learning rate for all parameter updates, Adam algorithm defines different learning rates for different parameters. The algorithm calculates the individual adaptive learning rates for each parameter based on the estimates of the first two moments of the gradients (first and the second order derivative f the Loss function).

So, each parameter has a unique learning rate, which is being updated using the exponential decaying average of the rst moments (the mean) and second moments (the variance) of the gradients.

Image Source: LunarTech.ai

About the Author — That’s Me!

Hello, fellow enthusiasts! I am Tatev, the person behind the insights and information shared in this blog. My journey in the vibrant world of Data Science and AI has been nothing short of incredible, and it’s a privilege to be able to share this wealth of knowledge with all of you.

Connect and Learn More:

[Get to know me better on LinkedIn](https://www.linkedin.com/in/tatev-karen-aslanyan/)
[Visit my personal website](https://tatevaslanyan.com/)
[Schdule a 1-on-1 with me](https://tatevaslanyan.com/contact/)

Feel free to connect; whether it’s to discuss the latest trends, seek career advice, or just to share your own exciting journey in this field. I believe in fostering a community where knowledge meets passion, and I’m always here to support and guide aspiring individuals in this vibrant industry.

Want to learn everything about Data Science and how to land a Data Science job? Download this FREE Data Science and AI Career Handbook

Meet Tatev, Blog Author

Behind this blog and LunarTech’s learning journey stands a passionate expert and industry veteran — [Your Name]. With a rich background in Data Science and AI, [Your Name] embodies a fusion of deep knowledge and a keen desire to facilitate meaningful learning experiences.

Connect with [Your Name]:

LinkedIn: [Your Profile Link]
Personal Website: [Your Website Link]

Get to know the mind behind LunarTech, where the mission is not just to educate, but to foster a community of aspiring professionals ready to make their mark in the tech industry. Feel free to reach out and connect to discuss industry trends, seek guidance, or simply to say hello!

Take Your Machine Learning Journey to the Next Level with LunarTech!

If this blog ignited a spark in you, imagine getting more value, insigths and pracwith LunarTech’s complete Machine Learning Course that includes complete ML theory, hands-on practice material, case studies, interview preparation and more. Embark on an enriching journey where your curiosity meets our treasure trove of resources and hands-on materials, meticulously crafted to transform you into a seasoned professional in the Data Science and AI sector.

Why Choose LunarTech?

Comprehensive Learning: Grasp the nuances of various machine learning algorithms, discussed extensively in this blog, through a structured learning path.
Career Readiness: Beyond technical skills, we equip you with invaluable insights to ace machine learning interviews and land your dream job.
Rich Resource Database: Access one of the largest databases of practical interview questions, career guides, and interview tips, fostering a seamless learning and job-hunting experience.
Practical Approach: Delve into hands-on material that allows you to practically implement and hone your machine learning skills, making you a sought-after candidate in the industry.
Global Recognition: LunarTechs commitment to fostering cutting- edge learning and its’ Ultimate Data Science Bootcamp have been recognized by prominent global platforms for its excellence in Data Science and advancing Machine Learning and AI domains. We are proud to have been recognized by Yahoo Finance, Bloomberg, Business Insider, MarketWatch, Benzinga, TechTimes and Others.

Dare to step boldly into the world of machine learning. Your path to becoming a proficient Machine Learning practitioner is just a click away. Discover your potential and redefine your career trajectory with LunarTech.

Ready to leap forward? Explore LunarTech’s The Ultimate Data Science Bootcamp and become a part of the revolution.