Back
A Comprehensive Guide to Clustering in Python
November 18, 2024

Learn key Machine Learning Clustering algorithms and topics in one place, K-Means, Hierarchical, DBScan clustering, Elbow Method, and t-SNE with examples, code and visualisations

Image Source: LunarTech

Want to learm the hidden patterns within your data? Clustering, an essential technique in Unsupervised Machine Learning, holds the key to discovering valuable insights that can revolutionize your understanding of complex datasets.

In this comprehensive guide to clustering in Python, we will delve into all must-know clustering algorithms and techniques, theory, combined with examples, Python implementation and visualization.

Here’s what you can expect to find in this blog:

1. Introduction to Unsupervised Learning
2. Supervised vs. Unsupervised Learning
3. Important Terminology
4. Preparing Data for Unsupervised Learning
5. Clustering Explained
6. K-Means Clustering
7. K-Means Clustering: Python Implementation
8. K-Means Clustering: Visualization
9. Elbow Method for Optimal Number of Clusters (K)
10. Hierarchical Clustering
11. Hierarchical Clustering: Python Implementation
12. Hierarchical Clustering: Visualization
13. DBSCAN Clustering
14. DBSCAN Clustering: Python Implementation
15. DBSCAN Clustering: Visualization
16. t-SNE advanced visualization of Clusters

Introduction to Unsupervised Learning

Unsupervised learning is a powerful technique in machine learning that allows us to uncover hidden patterns and structures within data without any predefined labels or target variables. Unlike supervised learning, which relies on labeled data for training, unsupervised learning provides us with the ability to explore and understand the inherent structure within unlabeled datasets.

One key application of unsupervised learning is clustering. Clustering is the process of grouping similar data points together based on their intrinsic characteristics and similarities. By identifying patterns and relationships within datasets, clustering helps us gain valuable insights and make sense of complex data.

Clustering finds its significance in various domains, including customer segmentation, anomaly detection, image recognition, and recommendation systems. It enables us to identify distinct groups within data, classify data into meaningful categories, and understand the underlying trends driving datasets.

In the next sections, we will delve deeper into different clustering algorithms, such as K-Means, hierarchical clustering, and DBSCAN, exploring their theories, implementations, and visualizations. By the end of this tutorial, you will have a comprehensive understanding of unsupervised learning and be equipped with the knowledge and skills to apply various clustering techniques to your own data analysis tasks.

Remember, clustering is just one aspect of unsupervised learning, which offers a range of other techniques and applications. So, let’s dive in and discover the exciting world of unsupervised learning and the power it holds for extracting insights from unlabeled data.

Image Source: LunarTech

Supervised vs. Unsupervised Learning

When it comes to machine learning, there are two primary approaches: supervised learning and unsupervised learning. Understanding the differences between these two approaches is crucial in selecting the right technique for your data analysis needs.

Supervised learning, as the name suggests, involves training a machine learning model on labeled data. In this approach, the input data consists of features (also known as attributes or variables) and corresponding target values or labels. The model learns from this labeled data and makes predictions or classifications based on new, unseen data.

On the other hand, unsupervised learning is all about exploring unlabeled data. With unsupervised learning, the data does not come with predefined labels or target values. Instead, the algorithm searches for patterns, structures, and relationships within the data on its own. The goal is to discover hidden insights and gain a deeper understanding of the underlying structure of the data.

One of the key advantages of unsupervised learning is its ability to uncover previously unknown patterns and relationships. Without the constraints of labeled data, unsupervised algorithms can reveal valuable insights that may not be apparent through other analytical methods. This makes unsupervised learning particularly useful in exploratory data analysis, anomaly detection, and clustering.

In supervised learning, the target variable serves as a guiding force for the learning process, enabling the model to make accurate predictions or classifications. However, this reliance on labeled data can also limit the model’s capabilities, as it may struggle with unrepresented or novel patterns that were not present in the training data.

In contrast, unsupervised learning allows for a more flexible and adaptable approach. It can capture the underlying structure and relationships within the data, even when explicit labels are unavailable. By leveraging clustering algorithms and dimensionality reduction techniques, unsupervised learning offers powerful tools to unravel complex datasets.

In summary, supervised learning is well-suited for tasks where labeled data is available and the goal is to make precise predictions or classifications. Unsupervised learning, on the other hand, is valuable when exploring data for hidden patterns and relationships, especially in cases where labeled data is scarce or non-existent. By understanding the differences between these two approaches, you can effectively choose the right technique to unleash the full potential of your data analysis efforts.

Important Terminology

To fully understand unsupervised learning and clustering, it’s crucial to be familiar with key terms associated with these concepts. Here are some important terminologies you should know:

Data Point

A data point refers to an individual observation or instance within a dataset. Each data point contains various features or attributes that describe a specific object or event.

Number of Clusters

The number of clusters represents the desired or estimated amount of distinct groups in which the data will be partitioned during the clustering process. It is an essential parameter that determines the structure of the resulting clusters.

Unsupervised Algorithm

An unsupervised algorithm is a mathematical procedure used to identify patterns or relationships in data without the need for labeled or pre-categorized examples. These algorithms explore the inherent structure and complexity of datasets to uncover hidden insights.

Understanding and utilizing these terminologies will lay a strong foundation for your journey into unsupervised learning and clustering. In the following sections, we will delve deeper into the practical aspects and implementation of clustering techniques in Python.

Image Source: LunarTech

Preparing Data for Unsupervised Learning

Before implementing unsupervised learning algorithms, it is crucial to ensure that the data is properly prepared. This involves taking certain steps to optimize the input data, making it suitable for analysis using clustering techniques. The following are important considerations when preparing data for unsupervised learning:

Data Normalization

One key aspect of data preparation is normalization, where all features are scaled to a consistent range. This is necessary because variables in the dataset may have different units or scales. Normalization helps avoid bias towards any particular feature during the clustering process. Common methods for normalization include min-max scaling and standardization.

Handling Missing Values

Dealing with missing values is another critical step. It is important to identify and address any missing values in the dataset before applying clustering algorithms. There are various techniques for handling missing values, such as imputation, where missing values are replaced with estimated values based on statistical methods or algorithms.

Outlier Detection and Treatment

Outliers can significantly impact clustering results, as they can influence the determination of cluster boundaries. Therefore, it is essential to detect and handle outliers appropriately. This can involve techniques like Z-score or interquartile range (IQR) analysis to identify and treat outliers.

Dimensionality Reduction

In some cases, the dataset might have a high dimensionality, meaning it contains a large number of features. High-dimensional data can be challenging to visualize and analyze effectively. Dimensionality reduction techniques, such as Principal Component Analysis (PCA), can be employed to reduce the number of features while retaining the most informative aspects of the data.

By carefully preparing the data, normalizing variables, handling missing values, addressing outliers, and reducing dimensionality when necessary, you can optimize the quality of input data for unsupervised learning algorithms. This ensures accurate and meaningful clustering results, leading to valuable insights and patterns within the data.

Remember, data preparation is a crucial step in the unsupervised learning process, setting the foundation for successful clustering analysis.

Image Source: LunarTech

Clustering Explained

Clustering is a fundamental technique in unsupervised learning that plays a crucial role in uncovering hidden patterns within data. It involves grouping data points based on their similarity, allowing us to identify distinct subsets or clusters within a dataset. By analyzing the structure of these clusters, we can gain valuable insights and make data-driven decisions.

Concept of Clustering

At its core, clustering aims to find similarities or relationships between data points without any predefined labels or target variables. The goal is to maximize the similarity within each cluster while maximizing the dissimilarity between different clusters. This process enables us to identify patterns and inherent structures within the data.

Clusters can be defined by various factors such as distance, connectivity, or density. Each data point within a cluster shares more similarities with other points in the same cluster than with points in other clusters. This grouping allows us to segment the data, which can be immensely useful in various domains such as customer segmentation, anomaly detection, and image recognition.

Types of Clustering Algorithms

There are several clustering algorithms available, each with its own approach to partitioning data into clusters. Some popular ones include K-Means Clustering, Hierarchical Clustering, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

K-Means Clustering

K-Means Clustering is a widely used algorithm that aims to partition data into K distinct clusters. It iteratively assigns each data point to the nearest cluster centroid and then recomputes the centroids. This process continues until convergence, resulting in well-defined clusters.

Hierarchical Clustering

Hierarchical Clustering creates a hierarchy of clusters by recursively dividing or merging them based on certain criteria. This approach can be represented as a dendrogram, which provides valuable insights into the hierarchy and relationships between clusters.

DBSCAN Clustering

DBSCAN is a density-based algorithm that groups data points based on their density and connectivity. It is particularly effective in identifying clusters of arbitrary shapes and handling noisy data.

These are just a few examples of clustering algorithms, each with its own strengths and suitability for specific scenarios. It is important to select the most appropriate algorithm based on the data characteristics and problem domain.

In the next sections, we will delve deeper into the theories, implementation, and visualization of these clustering algorithms to provide you with a comprehensive understanding of how they work and when to use them.

Remember, clustering is a powerful technique that allows us to unlock the hidden structures within our data, leading to valuable insights and informed decision-making. Let’s dive into the world of clustering and discover the potential it holds.

Image Source: LunarTech

K-Means Clustering

K-Means clustering is a popular unsupervised learning algorithm used to partition data points into distinct groups based on similarity. In this section, we will dive into the theory behind K-Means clustering and explore its implementation in Python using the scikit-learn library.

In Data Science and Data Analytics, we often want to categorize observations into set of segments or clusters for different purpose. For instance, company might want to cluster its customers into 3–5 groups based on their transaction history or frequency of purchases etc. This is usually an Unsupervised Learning approach where the labels (groups/segments/clusters) are unknown.

One of the most popular clustering approaches for clustering observations into groups is the unsupervised clustering algorithm K-Means. Following are conditions for K-Means clustering:

  • number of clusters needs to be specified in advance: K
  • every observation needs to belong to at least one class
  • every observation need to belong to only one class (classes need to be nonoverlapping)
  • no one observation should belong to more than 1 class

The idea behind K-Means is to minimize the within-cluster variation and maximize the between-cluster variation. So, for K-means to partition the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible.The motivation behind this is to cluster observation such that, the observations clustered to same group will be as similar as possible while the observations from different groups are as different as possible.

Mathematically, the within-cluster variation is defined based on the choice of distance measure which you can choose yourself. For instance, as distance measure you can use Euclidean distance, Manhattan distance etc.

K-means clustering is optimal when the within-cluster variation is the smallest. The within-cluster variation of C_k cluster is a measure W(C_k) of the amount by which the observations in a cluster differs from each other. Therefore, the following optimization problem should be solved:

Where within-cluster variation using Euclidean distance can be expressed as follows:

The number of observations in kth cluster is denoted by |C_k |. Thus, the optimization problem fo K-means can be described as follows:

K-Means Algorithm

The pseudocode of the K-means Algorithm can be described as follows:

Image Source: The Author

K-Means is a non-deterministic approach and it’s randomness comes in Step 1, where all observations are randomly assigned to 1 of the K classes.

In the next, 2nd step, for each cluster, the cluster centroids are calculating by calculating the mean values of all the data points in the cluster. The centoid of a Kth cluster is a vector of length p containing the means of all variables for the observations in the kth cluster; and wheere p is the number of variables.

Then, in the next step, the clusters of observations are updated, such that each observation is assigned to a cluster where the centroid is the closest, by iteratively minimizing the total within sum of squares. That is, we iterate steps 2 and 3 until the cluster centroids are no longer changing or the maximum number of iterations is reached.

K-Means Clustering: Python Implementation

Let’s us look at an example where we aim to classify observations to 4 classes. The raw data looks like this:

Image Source: The Author

from sklearn.cluster import KMeans
from sklearn import metrics
import numpy as np
import pandas as pd

# creating data for K-Means Clustering
df = np.random.randint(0,10,size = [100,2])
X1 = np.random.randint(0,4,size = [300,1])
X2 = np.random.uniform(0,10,size = [300,1])
df = np.append(X1,X2,axis = 1)
Clustered_df = KMeans_Algorithm(df = df,K =4)
df = pd.DataFrame(Clustered_df)

# Function for performing K-means
def KMeans_Algorithm(df,K):
# randomly assign each observation to a cluster
# obtain the centroids of the clusters
# reassign observations to clusters with the closest cerntroid
KMeans_model = KMeans(n_clusters=K,init = 'k-means++',max_iter = 300,
random_state = 2021)
KMeans_model.fit(df)
# storing the centroids
centroids = KMeans_model.cluster_centers_
centroids_df = pd.DataFrame(centroids,columns = ["X","Y"])
# getting the labels/classes
labels = KMeans_model.labels_
df = pd.DataFrame(df)
df["labels"] = labels
return(df)

K-Means Clustering: Visualization

One of the key advantages of K-Means is its simplicity and efficiency in handling large datasets. In this section, we will explore how to implement K-Means clustering in Python and visualize the results.

Understanding the K-Means Algorithm

Before we dive into the implementation, let’s briefly understand how the K-Means algorithm works. The algorithm follows these steps:

1. Step 1: Initialization: Randomly select K centroids, where K represents the desired number of clusters.

2. Step 2: Assignment: Assign each data point to the nearest centroid based on the Euclidean distance.

3. Step 3: Update: Recalculate the centroids by taking the mean of all data points assigned to each cluster.

4. Step 4: Repeat: Repeat steps 2 and 3 until convergence criteria are met (e.g., minimal centroid movement).

fig, ax = plt.subplots(figsize=(6, 6))

# for observations with each type of labels from column 1 and 2
plt.scatter(df[df["labels"] == 0][0], df[df["labels"] == 0][1],
c='black', label='cluster 1')
plt.scatter(df[df["labels"] == 1][0], df[df["labels"] == 1][1],
c='green', label='cluster 2')
plt.scatter(df[df["labels"] == 2][0], df[df["labels"] == 2][1],
c='red', label='cluster 3')
plt.scatter(df[df["labels"] == 3][0], df[df["labels"] == 3][1],
c='y', label='cluster 4')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=300, c='black', label='centroid')
plt.legend()
plt.xlim([-2, 6])
plt.ylim([0, 10])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Visualization of clustered data')
ax.set_aspect('equal')
plt.show()

Image Source: The Author

In figure below, K-means have clustered these observations into 4 groups, and as you can see from the visualisation, the way observations have been clustered even by the graph seems natural and making sense.

Elbow Method for Optimal Number of Clusters (K)

One of the biggest challenges in using K-means is the choice of clusters. Sometimes this is a business decision, but most of the time we want to pick K that is optimal and make sense. One of the most popular methods to determine this optimal value of K, or number of clusters is Elbow Method.

To use this approach you need to know what Inertia is. Inertia is the sum of squared distances of samples to their closest cluster center. So, the Inertia or within cluster of sum of squares value gives an indication of how coherent the different clusters are or how pure they are. Inertia can be described as follows:

where N is the number of samples within the data set, C is the centre of a cluster, k is the cluster index. So, the Inertia simply computes the squared distance of each sample in a cluster to its cluster centre and sums them up.

Then we can calculate the inertia for different number of clusters K, and we plot this as in the following figure where we consider K = 1,2,….,10. Then from thee graph we can select the K corresponding to the Inertia where the elbow occurs. In this case, K = 3 where Elbow happens.

Image Source: The Author

def Elbow_Method(df):
inertia = []
# considering K = 1,2,...,10 as K
K = range(1, 10)
for k in K:
KMeans_Model = KMeans(n_clusters=k, random_state = 2022)
KMeans_Model.fit(df)
inertia.append(KMeans_Model.inertia_)
return(inertia)

K = range(1, 10)
inertia = Elbow_Method(df)
plt.figure(figsize = (17,8))
plt.plot(K, inertia, 'bx-')
plt.xlabel("K: number of clusters")
plt.ylabel("Inertia")
plt.title("K-Means: Elbow Method")
plt.show()

K-Means is a non-deterministic approach and it’s randomness comes in Step 1, where all observations are randomly assigned to 1 of the K classes.

In conclusion, K-Means clustering offers an efficient and effective approach to group data points based on similarity. By implementing the K-Means algorithm in Python, we can easily apply this technique to our own datasets and gain valuable insights into our data.

In conclusion, Python provides powerful tools for implementing and visualizing K-Means clustering. With the scikit-learn library and matplotlib, you can easily apply K-Means to your datasets and gain valuable insights from the resulting clusters.

Image Source: LunarTech

Hierarchical Clustering Theory

Another popular clustering technique is Hierarchical Clustering which is another unsupervised learning technique, that helps to cluster observations into segments. However, unlike of K-means, Hirarchical Clustering starts by treating each observation as a separate cluster.

Agglomerative vs. Divisive Clustering

There are two main types of hierarchical clustering: agglomerative and divisive.

Agglomerative clustering starts by assigning each data point to its own cluster. Then, it iteratively merges the most similar clusters based on a chosen distance metric until a single cluster containing all data points is formed. This bottom-up approach creates a binary tree-like structure, also known as a dendrogram, where the height of each node represents the dissimilarity between the clusters being merged.

On the other hand, divisive clustering begins with a single cluster containing all data points. It then recursively divides the cluster into smaller subclusters until each data point is in its own cluster. This top-down approach generates a dendrogram that provides insights into the hierarchy of clusters.

Distance Metrics for Hierarchical Clustering

To determine the similarity between clusters or data points, various distance metrics can be used. Commonly employed distance measures include Euclidean distance, Manhattan distance, and cosine similarity. These metrics quantify the dissimilarity or similarity between pairs of data points and guide the clustering process.

In this technique, initially each data point is considered as an individual cluster. At each iteration, the most similar or the least dissimilar clusters merge into one cluster and this process continues until there is only a single cluster. So, the algorithm repeatedly performs the following 3 steps:

  • 1: identify the two clusters that are closest together
  • 2: merge the two most similar clusters.

Then it continues this iterative process until all the clusters are merged together. The dissimilarity or similarity of two clusters calculation depends on the Linkage type one assumes. There are 5 popular linkage options:

  • Complete Linkage: max intercluster dissimilarity for which you need to compute all pairwise dissimilarities between the observations in cluster K1 and the observations in cluster K2. Then pick the largest of these similarities.
  • Single Linkage: min intercluster dissimilarity for which you need to compute all pairwise dissimilarities between the observations in cluster K1 and the observations in cluster K2. Then pick the smallest of these similarities.
  • Average Linkage: mean intercluster dissimilarity for which you need to compute all pairwise dissimilarities between the observations in cluster K1 and the observations in cluster K2. Then calculate the average of these similarities.
  • Centroid Linkage: dissimilarity between the centroid of cluster K1 and centroid of cluster K2. (this is usually the less desired choice of linkage since it might result in lot of overlap)
  • Ward’s method: work out which observations to cluster based on reducing the sum of squared distances of each observation from the average observation in a cluster.

Hierarchical Clustering Python Implementation

Hierarchical clustering is a powerful unsupervised learning technique that allows you to group data points into clusters based on their similarity. In this section, we will explore the implementation of hierarchical clustering using Python.

Here is an example of how to implement hierarchical clustering using Python:

import scipy.cluster.hierarchy as HieraarchicalClustering
from sklearn.cluster import AgglomerativeClustering
import numpy as np
import pandas as pd

# creating data for Hierarchical Clustering
df = np.random.randint(0,10,size = [100,2])
X1 = np.random.randint(0,4,size = [300,1])
X2 = np.random.uniform(0,10,size = [300,1])
df = np.append(X1,X2,axis = 1)
hierCl = HieraarchicalClustering.linkage(df, method='ward')

Hcl= AgglomerativeClustering(n_clusters = 7, affinity = 'euclidean', linkage ='ward')
Hcl_fitted = Hcl.fit_predict(df)
df = pd.DataFrame(df)
df["labels"] = Hcl_fitted

Hierarchical Clustering: Visualization

Hierarchical clustering is a powerful technique in unsupervised learning that allows us to uncover underlying patterns in our data by creating clusters based on similarity. One of the key advantages of hierarchical clustering is its ability to create a hierarchical structure of clusters, which can provide valuable insights into the relationships between data points.

To visualize hierarchical clustering in Python, we can use various libraries such as Scikit-learn, SciPy, and Matplotlib. These libraries offer easy-to-use functions and tools that facilitate the visualization process.

So, after performing hierarchical clustering, it is often helpful to visualize the clusters. We can use various techniques for visualization, such as dendrograms or heatmaps.

A dendrogram is a tree-like diagram that shows the hierarchical relationships between clusters. It can be generated using the scipy library in Python.

Here is an example of how to visualize dendogram and clustered points in Python:

# getting the dendogram to pick right number of clusters
dendrogram = HieraarchicalClustering.dendrogram(hierCl)
plt.title('Dendrogram')
plt.xlabel("Observations")
plt.ylabel('Euclidean distances')
plt.show()

# Visualizing the clustered data
plt.scatter(df[df["labels"] == 0][0], df[df["labels"] == 0][1],
c='black', label='cluster 1')
plt.scatter(df[df["labels"] == 1][0], df[df["labels"] == 1][1],
c='green', label='cluster 2')
plt.scatter(df[df["labels"] == 2][0], df[df["labels"] == 2][1],
c='red', label='cluster 3')
plt.scatter(df[df["labels"] == 3][0], df[df["labels"] == 3][1],
c='magenta', label='cluster 4')
plt.scatter(df[df["labels"] ==4][0], df[df["labels"] == 4][1],
c='purple', label='cluster 5')
plt.scatter(df[df["labels"] == 5][0], df[df["labels"] == 5][1],
c='y', label='cluster 6')
plt.scatter(df[df["labels"] ==6][0], df[df["labels"] == 6][1],
c='black', label='cluster 7')
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Hierarchical Clustering ')
plt.show()

Here is a step-by-step guide to visualizing hierarchical clustering in Python:

Step 1: Preprocess the data

Before visualizing hierarchical clustering, it is important to preprocess the data by scaling or normalizing it. This ensures that all features have a similar range and prevents any bias towards specific features.

Step 2: Perform hierarchical clustering

Next, we perform hierarchical clustering using the chosen algorithm, such as AgglomerativeClustering from Scikit-learn. This algorithm calculates the similarity between data points and merges them into clusters based on a specific linkage criterion.

Step 3: Create a dendrogram

A dendrogram is a tree-like diagram that displays the hierarchical structure of the clusters. We can use the dendrogram function from the SciPy library to create this visualization. The dendrogram allows us to visualize the distances and relationships between clusters.

Step 4: Plot the clusters

Finally, we can plot the clusters using a scatter plot or another suitable visualization technique. This helps us visualize the data points within each cluster and gain insights into the characteristics of each cluster.

Image Source: The Author

This dendogram can then help us to decide the number of clusters we can better use. Seems like, that in this case we can better use 7 clusters.

Image Source: The Author

By visualizing hierarchical clustering in Python, we can gain a better understanding of the structure and relationships within our data. This visualization technique is particularly useful when dealing with complex datasets and can assist in decision-making processes and pattern discovery.

Remember to adjust the specific parameters and settings based on your dataset and objective. Experimenting with different visualizations and techniques can lead to even deeper insights into your data.

Image Source: LunarTech

DBSCAN Clustering Theory

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised learning algorithm used for clustering analysis. It is particularly effective in identifying clusters of arbitrary shape and handling noisy data. Unlike K-Means or Hierarchical clustering, DBSCAN does not require specifying the number of clusters in advance. Instead, it defines clusters based on density and connectivity within the data.

How DBSCAN Works:

1. Density-Based Clustering: DBSCAN groups data points together that are in close proximity to each other and have a sufficient number of nearby neighbors. It identifies dense regions of data points as clusters and separates sparse regions as noise.

2. Core Points, Border Points, and Noise Points: DBSCAN categorizes data points into three types: Core Points, Border Points, and Noise Points.

- Core Points: Data points with a minimum number of neighboring points (defined by the `min_samples` parameter) within a specified distance (defined by the `eps` parameter).

- Border Points: Data points that are within the `eps` distance of a Core Point but do not have enough neighboring points to be considered Core Points.

- Noise Points: Data points that are neither Core Points nor Border Points.

3. Reachability and Connectivity: DBSCAN uses the notions of reachability and connectivity to define clusters. A data point is considered reachable from another data point if there is a path of Core Points that connects them. If two data points are reachable, they belong to the same cluster.

4. Cluster Growth: DBSCAN starts with an arbitrary data point and expands the cluster by examining its neighbors and their neighbors, forming a connected group of data points.

Benefits of DBSCAN Clustering:

- Ability to Detect Complex Structures: DBSCAN can discover clusters of various shapes and sizes, making it well-suited for datasets with non-linear relationships or irregular patterns.

- Robust to Noise: DBSCAN handles noisy data effectively by categorizing noise points separately from clusters.

- Automatic Determination of Cluster Numbers: DBSCAN does not require specifying the number of clusters in advance, making it more convenient and adaptable to different datasets.

- Scaling to Large Datasets: DBSCAN’s time complexity is relatively low compared to some other clustering algorithms, allowing it to scale well to large datasets.

In the next section, we will delve into the implementation of the DBSCAN algorithm in Python, providing step-by-step guidance and examples.

DBSCAN Clustering: Python Implementation

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular clustering algorithm that is particularly effective in identifying clusters of arbitrary shapes in dense data sets. This section will guide you through the implementation of DBSCAN using Python.

Key Steps for DBSCAN Clustering

1. Data Preparation: Before applying DBSCAN, it is important to preprocess your data. This includes handling missing values, normalizing features, and selecting the appropriate distance metric.

2. Defining Parameters: DBSCAN requires two main parameters: epsilon (ε) and minimum points (MinPts). Epsilon determines the maximum distance between two points to consider them as neighbors, and MinPts specifies the minimum number of points required to form a dense region.

3. Density-Based Clustering: DBSCAN starts by randomly selecting a data point and identifying its neighbors within the specified epsilon distance. If the number of neighbors exceeds the MinPts threshold, a new cluster is formed. The algorithm expands this cluster by iteratively adding new points until no more points can be reached.

4. Noise Detection: Points that do not belong to any cluster are considered as noise or outliers. These points are not assigned to any cluster and can be critical in identifying anomalies within the data.

To perform DBSCAN clustering in Python, we can use the scikit-learn library. The first step is to import the necessary libraries and load the dataset we want to cluster. Then, we can create an instance of the DBSCAN class and set the epsilon (eps) and minimum number of samples (min_samples) parameters.

Here is a sample code snippet to get you started:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Generate some sample data
X, _ = make_moons(n_samples=500, noise=0.05, random_state=0)

# Apply DBSCAN
db = DBSCAN(eps=0.3, min_samples=5, metric='euclidean')
y_db = db.fit_predict(X)

Remember to replace `X` with your actual data set. You can adjust the eps and min_samples parameters to get different clustering results. The eps parameter is the maximum distance between two samples for one to be considered as in the neighborhood of the other. The min_samples is the number of samples (or total weight) in a neighborhood for a point to be considered as a core point.

DBSCAN offers several advantages over other clustering algorithms. It does not require the number of clusters to be predefined, making it suitable for data sets with an unknown number of clusters. DBSCAN is also capable of identifying clusters of varying shapes and sizes, making it more flexible in capturing complex structures.

However, DBSCAN may struggle with varying densities in data sets and can be sensitive to the choice of epsilon and minimum points parameters. It is crucial to fine-tune these parameters to obtain optimal clustering results.

By implementing DBSCAN in Python, you can leverage this powerful clustering algorithm to uncover meaningful patterns and structures in your data. Let’s move on to the next section to explore the differences between DBSCAN and other clustering techniques.

eps=0.3: Specifies how close points should be to each other to be considered a part of a cluster. Points that lie within the eps radius of each other are considered neighbors.

min_samples=5: The minimum number of points required to form a dense region.

The DBSCAN algorithm groups together closely packed points (or a cluster) and marks low-density regions as outliers (or noise). In our code, y_db contains the cluster labels assigned by DBSCAN. Here, if two points belong to the same cluster, they will have the same label in y_db.

  • Points in cluster 1 are marked with light blue color and circle markers (o).
  • Points in cluster 2 are marked with red color and square markers (s).

plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1],
c='lightblue', marker='o', s=40,
edgecolor='black',
label='cluster 1')
plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1],
c='red', marker='s', s=40,
edgecolor='black',
label='cluster 2')
plt.legend()
plt.show()

Image Source: The Author

Resulting plot will show two moon-shaped clusters in light blue and red colors, demonstrating that DBSCAN successfully identified and separated the two interleaved half circles.

Image Source: LunarTech

How to evaluate the performance of a clustering algorithm?

Evaluating the performance of a clustering model can be challenging, as there are no ground truth labels available in unsupervised learning. However, there are several evaluation metrics that can provide insights into the quality of the clustering results.

- Silhouette coefficient: Measures how well each data point fits into its assigned cluster compared to other clusters. A higher silhouette coefficient indicates better clustering.

- Davies-Bouldin index: Measures the average similarity between each cluster and its most similar cluster, while considering the separation between clusters. Lower values indicate better clustering.

- Calinski-Harabasz index: Evaluates the ratio of between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters.

- Visual assessment: Inspecting visual representations of the clustering results, such as scatter plots or dendrograms, can also provide valuable insights into the quality and meaningfulness of the clusters.

I would also recommended to employ a combination of evaluation metrics and visual assessments to comprehensively assess the performance of a clustering model.

Difference between K-Means and Hierarchical clustering

K-Means and Hierarchical clustering are two popular techniques used in unsupervised learning for clustering data. While both approaches aim to group similar data points together, they differ in their approach and characteristics. Understanding the differences between K-Means and Hierarchical clustering is crucial for choosing the most suitable technique for your data analysis needs.

K-Means Clustering

K-Means clustering is a centroid-based algorithm. It aims to partition data points into a predetermined number of clusters (K) based on their similarity. The algorithm starts by randomly initializing K centroids and then iteratively assigns each data point to the nearest centroid. Once all data points are assigned, the centroids are recalculated based on the mean of the points within each cluster. This process continues until convergence is reached.

Strengths of K-Means Clustering:

- Efficient and scalable for large datasets.

- Well-suited for numerical data.

- Intuitive and easy to implement.

- Converges to a solution in a finite number of steps.

Weaknesses of K-Means Clustering:

- Requires the number of clusters (K) to be predefined.

- Sensitive to initial centroid positions, which affects the final results.

- Non-ideal for clustering irregularly shaped or overlapping clusters.

- Does not handle noise or outliers well.

Hierarchical Clustering

Hierarchical clustering, as the name suggests, creates a hierarchy of clusters. It does not require a predefined number of clusters. Instead, it starts with each data point as an individual cluster and progressively merges similar clusters until a desired number or a single cluster remains. The result is a dendrogram, a tree-like structure that visualizes the nested clusters.

Strengths of Hierarchical Clustering:

- Does not require predefining the number of clusters.

- Can handle various types of data, including categorical and numerical.

- Provides a visual representation of cluster relationships through the dendrogram.

- Can capture complex cluster structures, including irregularly shaped clusters.

Weaknesses of Hierarchical Clustering:

- Less scalable and computationally expensive for large datasets.

- Can be challenging to select the appropriate number of clusters from the dendrogram.

- Prone to producing imbalanced clusters when dealing with unevenly distributed data.

- Sensitive to noise and outliers, which can affect the clustering hierarchy.

In summary, while K-Means clustering is efficient for large datasets and well-suited for numerical data, Hierarchical clustering offers flexibility with no predefined number of clusters and can capture complex structures. Consider the characteristics of your data and the objectives of your analysis when selecting the appropriate clustering technique.

Image Source: LunarTech

t-SNE for visualization of Clusters with Python

t-SNE, or t-Distributed Stochastic Neighbor Embedding, is a popular and powerful dimensionality reduction technique used in unsupervised learning. It is particularly effective for visualizing high-dimensional data in a lower-dimensional space. In this section, we will explore the theory behind t-SNE and its implementation in Python.

Understanding t-SNE

t-SNE was introduced by Laurens van der Maaten and Geoffrey Hinton in 2008 as a method to visualize complex data structures. It aims to represent high-dimensional data points in a lower-dimensional space while preserving the local structure and pairwise similarities among the data points. t-SNE achieves this by modeling the similarity between data points in the high-dimensional space and the low-dimensional space.

The t-SNE Algorithm

The t-SNE algorithm proceeds in the following steps:

1. Compute pairwise similarities between data points in the high-dimensional space. This is typically done using a Gaussian kernel to measure the similarity based on the Euclidean distances between data points.

2. Initialize the low-dimensional embedding randomly.

3. Define a cost function that represents the similarity between data points in the high-dimensional space and the low-dimensional space.

4. Optimize the cost function using gradient descent to minimize the divergence between the high-dimensional and low-dimensional similarities.

5. Iterate steps 3 and 4 until the cost function converges.

Implementing t-SNE in Python is relatively straightforward with the help of libraries such as scikit-learn. The scikit-learn library provides a user-friendly API for applying t-SNE to your data. By following the scikit-learn documentation and examples, you can easily incorporate t-SNE into your machine learning pipeline.

2D t-SNE Visualisation

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE

# Load dataset
digits = datasets.load_digits()
X, y = digits.data, digits.target

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=0)
X_tsne = tsne.fit_transform(X)

# Visualize the results on 2D plane
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, edgecolor='none', alpha=0.7, cmap=plt.cm.get_cmap('jet', 10))
plt.colorbar(scatter)
plt.title("t-SNE of Digits Dataset")
plt.show()

Image Source: The Author

In this example:

  1. We load the digits dataset.
  2. We apply t-SNE to reduce the data from 64 dimensions (since each image is 8x8) to 2 dimensions.
  3. We then plot the transformed data, coloring each point by its true digit label.

The resulting visualization will show clusters, each corresponding to one of the digits (0 through 9). This helps to understand how well-separated the different digits are in the original high-dimensional space.

Visualizing High-Dimensional Data

One of the main advantages of t-SNE is its ability to visualize high-dimensional data in a lower-dimensional space. By reducing the dimensionality of the data, t-SNE enables us to identify clusters and patterns that may not be apparent in the original high-dimensional space. The resulting visualization can provide valuable insights into the structure of the data and aid in decision-making processes.

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.manifold import TSNE
from mpl_toolkits.mplot3d import Axes3D

# Load dataset
digits = datasets.load_digits()
X, y = digits.data, digits.target

# Apply t-SNE
tsne = TSNE(n_components=3, random_state=0)
X_tsne = tsne.fit_transform(X)

# Visualize the results on 3D plane
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c=y, edgecolor='none', alpha=0.7, cmap=plt.cm.get_cmap('jet', 10))
plt.colorbar(scatter)
plt.title("3D t-SNE of Digits Dataset")
plt.show()

In this revised code:

  1. We set n_components=3 for t-SNE to get a 3D transformation.
  2. We use mpl_toolkits.mplot3d.Axes3D to create a 3D scatter plot.

After executing this code, you’ll see a 3D scatter plot where points are positioned based on their t-SNE coordinates, and they’re colored based on their true digit label.

Rotating the 3D visualization can help in understanding the spatial distribution of the data points better.

Image Source: The Author

In conclusion, t-SNE is a powerful tool for dimensionality reduction and visualization of high-dimensional data. By leveraging its capabilities, you can gain a deeper understanding of complex datasets and uncover hidden patterns that may not be immediately obvious. With its Python implementation and ease of use, t-SNE is a valuable asset for any data scientist or machine learning practitioner.

Image Source: LunarTech

More Unsupervised Learning Techniques

In addition to the discussed clustering techniques, there are several other important unsupervised learning techniques worth exploring. While we won’t delve into them in detail here, let’s briefly mention two such techniques: mixture models and topic modeling.

Mixture Models

Mixture models are probabilistic models used for modeling complex data distributions. They assume that the overall dataset can be described as a combination of multiple underlying subpopulations or components, each described by its own probability distribution. Mixture models can be particularly useful in situations where data points do not clearly belong to distinct clusters and may exhibit overlapping characteristics.

Topic Modeling

Topic modeling is a technique used to extract underlying themes or topics from a collection of documents. It allows for the exploration and discovery of latent semantic patterns in text data. By analyzing the co-occurrence of words across documents and identifying common themes, topic modeling enables automatic categorization and summarization of large textual datasets. This technique has applications in fields like natural language processing, information retrieval, and content recommendation systems.

While these techniques warrant further exploration beyond the scope of this tutorial, they are valuable tools to consider for uncovering hidden patterns and gaining insights from your data.

Remember, mastering unsupervised learning involves continuous learning and practice. By familiarizing yourself with different techniques like the ones mentioned above, you’ll be well-equipped to tackle a wide range of data analysis problems across various domains.

Now that we have covered a comprehensive overview of unsupervised learning techniques, it’s time to dive deeper into frequently asked questions to gain further clarity and expand your knowledge.

FAQs

Q: What is the difference between supervised and unsupervised learning?

Supervised learning involves training a model on labeled data, where the inputs are paired with corresponding outputs. The goal is to predict the output for new, unseen inputs. In contrast, unsupervised learning deals with unlabeled data, where the goal is to discover patterns, structures, or clusters within the data without any predefined output. Essentially, supervised learning aims to learn a mapping function, while unsupervised learning focuses on uncovering hidden relationships or groupings in the data.

Q: Which clustering algorithm is best for my data?

The suitability of a clustering algorithm depends on various factors, such as the nature of the data, the desired number of clusters, and the specific problem you are trying to solve. Commonly used clustering algorithms include K-means, hierarchical clustering, and DBSCAN.

- K-means is a popular algorithm that aims to partition the data into K clusters, with each data point assigned to the nearest centroid. It works well for evenly distributed, spherical clusters and requires the number of clusters to be specified in advance.

- Hierarchical clustering builds a hierarchy of clusters by iteratively merging or splitting them. It provides a dendrogram to visualize the clustering process and can handle different shapes and sizes of clusters.

- DBSCAN is a density-based algorithm that groups together data points that are close to each other and separates outliers. It can discover clusters of arbitrary shape and does not require the number of clusters to be known beforehand.

To determine the best algorithm, it is recommended to experiment with different techniques and assess their performance based on metrics like cluster quality, computational efficiency, and interpretability.

Q: Can unsupervised learning be used for predictive analytics?

While unsupervised learning primarily focuses on discovering patterns and relationships within data without specific output labels, it can indirectly support predictive analytics. By uncovering hidden structures and clusters within the data, unsupervised learning can provide insights that enable better feature engineering, anomaly detection, or segmentation, which can subsequently enhance the performance of predictive models.

Unsupervised learning techniques like clustering can help identify distinct groups or patterns in the data, which can be used as input features for predictive models or serve as a basis for generating new predictive variables. Therefore, unsupervised learning plays a valuable role in predictive analytics by facilitating a deeper understanding of the data and enhancing the accuracy and effectiveness of predictive models.

Image Source: LunarTech

FREE Data Science and AI Resources

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook

FREE Data Science and AI Career Handbook

Want to learn Machine Learning from scratch, or refresh your memory? Download this FREE Machine Learning Fundamentals Handbook to get all Machine Learning fundamentals combiend with examples in Python in one place.

FREE Machine Learning Fundamentals Handbook

Want to learn Java Programming from scratch, or refresh your memory? Download this FREE Java Porgramming Fundamnetals Books to get all Java fundamentals combiend with interview preparation and code examples.

FREE Java Porgramming Fundamnetals Books

About the Author — That’s Me!

I am Tatev, Senior Machine Learning and AI Researcher. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands.

With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI including NLP, LLM and GenAI, I’ve gathered this knowledge to share with you.

Become Job Ready Data Scientist with LunarTech

After gaining so much from this guide, if you’re keen to dive even deeper and structured learning is your style, consider joining us at LunarTech. Become job ready data scientist with The Ultimate Data Science Bootcamp which has earned the recognition of being one of the Best Data Science Bootcamps of 2023, and has been featured in esteemed publications like Forbes, Yahoo, Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. [Enroll to The Ultimate Data Science Bootcamp at LunarTech]

Enroll Here

[Not Just For Tech Giants: Here’s How LunarTech Revolutionizes Data Science and AI Learning
In the digital age, where the world is in constant flux, Tatev Aslanyan and Vahe Aslanyan have united to redefine AI…forbes.com.au](https://www.forbes.com.au/brand-voice/uncategorized/not-just-for-tech-giants-heres-how-lunartech-revolutionizes-data-science-and-ai-learning/ "forbes.com.au/brand-voice/uncategorized/not..")

[Outpacing Competition: How LunarTech is Redefining the Future of AI and Machine Learning |…
Opinions expressed by Entrepreneur contributors are their own. You’re reading Entrepreneur Georgia, an international…entrepreneur.com](https://www.entrepreneur.com/ka/business-news/outpacing-competition-how-lunartech-is-redefining-the/463038 "entrepreneur.com/ka/business-news/outpacing..")

[LunarTech Launches a Game Changing Data Science Education Bootcamp, Making Advanced AI and Machine…
Austin, Texas — (Newsfile Corp. — August 25, 2023) — LunarTech, an innovative online tech education platform, is…finance.yahoo.com](https://finance.yahoo.com/news/lunartech-launches-game-changing-data-115200373.html "finance.yahoo.com/news/lunartech-launches-g..")

Connect with Me:

[The Data Science and AI Newsletter | Tatev Karen | Substack
Where businesses meet breakthroughs, and enthusiasts transform to experts! From creator of 2023 top-rated Data Science…tatevaslanyan.substack.com](https://tatevaslanyan.substack.com/ "tatevaslanyan.substack.com")

Want to learn Machine Learning from scratch, or refresh your memory? Download this FREE Machine Learning Fundamentals Handbook

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook.

Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!

News & Insights