Back
Data Sampling Methods in Python
November 18, 2024

A ready-to-run code with different data sampling techniques to create a random and representative sample in Python

Image Source:Pexels/christian-diokno

Data Sampling forms the essential part of the majority of research, scientific and data experiments. It is one of the most important factors which determines the accuracy of your research or survey result. If your sample has not been accurately sampled then this might impact significantly the final results and conclusions. There are many sampling techniques that can be used to gather a data sample depending upon the need and situation. In this blog post, I will cover the following data sampling techniques:

- Terminology: Population and Sampling
- Random Sampling
- Systematic Sampling
- Cluster Sampling
- Weighted Sampling
- Stratified Sampling

FREE Data Science and AI Handbook

How to Start a Career in Data Science

LunarTech.ai published at FreeCodeCamp

Download for FREE here

Introduction to Population and Sample

To start with, let’s have a look at some basic terminology. It is important to learn the concepts of population and sample. The population is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a sample is a subset of observations from the population that ideally is a true representation of the population.

Image Source: The Author

Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased.

Random Sampling

The simplest data sampling technique that creates a random sample from the original population is Random Sampling. In this approach, every sampled observation has the same probability of getting selected during the sample generation process. Random Sampling is usually used when we don’t have any kind of prior information about the target population.

For example random selection of 3 individuals from a population of 10 individuals. Here, each individual has an equal chance of getting selected to the sample with a probability of selection of 1/10.

Image Source: The Author

Random Sampling: Python Implementation

First, we generate random data that will serve as population data. We will, therefore, randomly sample 10K data points from Normal distribution with mean mu = 10 and standard deviation std = 2. After this, we create a Python function called random_sampling() that takes population data and desired sample size and produces as output a random sample.

Systematic Sampling

Systematic sampling is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.

Stated differently, systematic sampling is an extended version of probability sampling techniques in which each member of the group is selected at regular periods to form a sample. We calculate the sampling interval by dividing the entire population size by the desired sample size.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Image Source: The Author

Systematic Sampling: Python Implementation

We generate data that serve as population data as in the previous case. We then create a Python function called systematic_sample() that takes population data and interval for the sampling and produces as output a systematic sample.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Cluster Sampling

Cluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected.

For example, if you want to conduct an experience evaluating the performance of sophomores in business education across Europe. It is impossible to conduct an experiment that involves a student in every university across the EU. Instead, by using Cluster Sampling, we can group the universities from each country into one cluster. These clusters then define all the sophomore student population in the EU. Next, you can use simple random sampling or systematic sampling and randomly select cluster(s) for the purposes of your research study.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Image Source: The Author

Cluster Sampling: Python Implementation

First, we generate data that will serve as population data with 10K observations, and this data consists of the following 4 variables:

  • Price*:* generated using Uniform distribution,
  • Id
  • event_type*: which is a categorical variable with 3 possible values {type1, type2, type3}*
  • click: binary variable taking values {0: no click, 1: click}
  • id price event_type click
    0 0 1.767794 type2 0
    1 1 2.974360 type2 0
    2 2 2.903518 type2 0
    3 3 3.699454 type2 1
    4 4 1.416739 type1 0
    ... ... ... ... ...
    9995 9995 3.689656 type2 1
    9996 9996 1.929186 type3 0
    9997 9997 2.393509 type3 1
    9998 9998 1.276473 type2 1
    9999 9999 3.959585 type2 1

[10000 rows x 4 columns]

Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.

id price event_type click cluster
4847 4847 3.813680 type3 0 17
567 567 1.642347 type2 0 17
8982 8982 3.741744 type3 1 17
2321 2321 2.192724 type3 0 17
5045 5045 3.645671 type2 0 17
... ... ... ... ... ...
5681 5681 3.175308 type1 0 90
882 882 2.676477 type2 1 90
2090 2090 3.861775 type3 1 90
907 907 1.947100 type3 0 90
2723 2723 2.557626 type1 0 90

[200 rows x 5 columns]

Note that, Cluster Sampling usually produces a random sample but is not addressing the bias in the created sample.

Weighted Sampling

In some experiments, you might need items sampling probabilities to be according to weights associated with each item, that’s when the proportions of the type of observations should be taken into account. For example, you might need a sample of queries in a search engine with weight as a number of times these queries have been performed so that the sample can be analyzed for overall impact on the user experience. In this case, Weighted Sampling is much more preferred compared to Random Sampling or Systematic Sampling.

Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. If a biased data set is not adjusted and a simple random sampling type of approach is used instead, then the population descriptors (e.g., mean, median) will be skewed and they will fail to correctly represent the population’s proportion to the population.

Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population. Hence, Weighted Sampling usually produces a random and unbiased sample.

Image Source: The Author

Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.

Weighted Sampling: Python Implementation

The function get_weighted_sample() takes as inputs the original data, and the desired sample size, and produces as output a weighted sample. Note that, the proportions, in this case, are defined based on the click event. That is, we compute the proportion of data points that had click events of 1 (let’s say X%) and 0 (Y%, where Y% = 100-X%), then we generate a random sample such that, the sample will also contain X% observations with click = 1 and Y% observations with click = 0.

id price event_type click
event_type
type1 0 6780 1.200188 type1 1
1 8830 2.990630 type1 1
2 8997 3.483728 type1 0
3 7541 2.402993 type1 1
4 4460 2.959203 type1 0
... ... ... ... ...
type3 29 5058 3.426289 type3 1
30 5855 3.852197 type3 0
31 6295 2.679898 type3 0
32 8978 1.115072 type3 1
33 7730 1.208441 type3 1

[100 rows x 4 columns]

Weighted Sampling usually produces a random and unbiased sample.

Stratified Sampling

Stratified Sampling is a data sampling approach, where we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location, event type etc.).

Every member of the population studied should be in exactly one stratum. Each stratum is then sampled using Cluster Sampling, allowing data scientists to estimate statistical measures for each sub-population. We rely on Stratified Sampling when the populations’ characteristics are diverse and we want to ensure that every characteristic is properly represented in the sample.

So, Stratified Sampling, is simply, the combination of Clustered Sampling and Weighted Sampling.

Image Source: The Author

Stratified Sampling: Python Implementation

The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event. Secondly, it performs clustered sampling using the event_type.

id price event_type click cluster
0 5131 2.707995 type1 0 45
1 5102 1.677190 type1 0 45
2 7370 1.893061 type1 0 45
3 4207 2.491246 type1 0 45
4 8909 3.252655 type1 1 45
.. ... ... ... ... ...
96 3254 2.637625 type3 0 85
97 1555 1.196040 type3 1 85
98 7627 3.240507 type3 1 85
99 6405 1.607379 type3 0 85
100 1075 2.471806 type3 0 85

[202 rows x 5 columns]

Stratified Sampling, is basically, the combination of Clustered Sampling and Weighted Sampling.

If you liked this article, here are some other articles you may enjoy:

[How To Crack Spotify Data Science Technical Screen Interview
List of exact Python/SQL commands and experimentation topics you should know to nail Spotify Tech Screentowardsdatascience.com](https://towardsdatascience.com/how-to-crack-spotify-data-science-technical-screen-interview-23f0f7205928 "towardsdatascience.com/how-to-crack-spotify..")

[Fundamentals Of Statistics For Data Scientists and Data Analysts
Key statistical concepts for your data science or data analytics journeytowardsdatascience.com](https://towardsdatascience.com/fundamentals-of-statistics-for-data-scientists-and-data-analysts-69d93a05aae7 "towardsdatascience.com/fundamentals-of-stat..")

[Simple and Complete Guide to A/B Testing
End-to-end A/B testing for your Data Science experiments for non-technical and technical specialists with examples and…towardsdatascience.com](https://towardsdatascience.com/simple-and-complet-guide-to-a-b-testing-c34154d0ce5a "towardsdatascience.com/simple-and-complet-g..")

[Monte Carlo Simulation and Variants with Python
Your Guide to Monte Carlo Simulation and Must Know Statistical Sampling Techniques With Python Implementationtowardsdatascience.com](https://towardsdatascience.com/monte-carlo-simulation-and-variants-with-python-43e3e7c59e1f "towardsdatascience.com/monte-carlo-simulati..")

[Bias-Variance Trade-off in Machine Learning
Introduction to bias-variance trade-off in Machine Learning and Statistical modelstatev-aslanyan.medium.com](https://tatev-aslanyan.medium.com/bias-variance-trade-off-in-machine-learning-7f885355e847 "tatev-aslanyan.medium.com/bias-variance-tra..")

[Data Sampling Methods in Python
A ready-to-run code with different data sampling techniques to create a random sample in Pythontatev-aslanyan.medium.com](https://tatev-aslanyan.medium.com/data-sampling-methods-in-python-a4400628ea1b "tatev-aslanyan.medium.com/data-sampling-met..")

[PySpark Cheat Sheet: Big Data Analytics
Here is a cheat sheet for the essential PySpark commands and functions. Start your big data analysis in PySpark.medium.com](https://medium.com/analytics-vidhya/pyspark-cheat-sheet-big-data-analytics-161a8e1f6185 "medium.com/analytics-vidhya/pyspark-cheat-s..")

The Ultimate Data Science Bootcamp by LunarTech

Ready to break into Data Science and AI in 2023? Aspiring to become a job-ready Data Scientist in the shortest amount of time? LunarTech.ai, an online tech education platform, offers all-inclusive bootcamp, The Ultimate Data Science Bootcamp that is your ticket to success.

[Data Science Bootcamp — LunarTech
In the initial section of our Data Science bootcamp, you’ll embark on a comprehensive journey through the realm of Data…lunartech.ai](https://lunartech.ai/course-overview/ "lunartech.ai/course-overview")

The Ultimate Data Science Bootcamp offered by LunarTech is designed to Ignite your Data Science career, transforming you into a world-class job-ready Data Scientist. We offer everything you need in one comprehensive, affordable package. This bootcamp provides a strong foundation in essential theoretical and technical skills, practical experience through real-world projects, and comprehensive career guidance including interview preparation.

We strike the perfect balance between teaching core fundamentals and practical implementation. With LunarTech, you’re not just learning, you’re preparing for a successful career in data science at your own pace. (For the full curriculum of the bootcamp click here)

But we go beyond training. We offer job placement assistance, expert resume building, and a community of ambitious individuals, all striving for success. With LunarTech, you’re not just enrolling in a program; you’re propelling toward a brighter future in Data Science and Artificial Intelligence.

So, are you ready to seize the opportunities of 2023 and become a job-ready Data Scientist? Your future is just a click away. (Enroll to The Ultimate Data Science Bootcamp click here)

About the Author — That’s Me!

I am Tatev, Senior Machine Learning and AI Researcher. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands. With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI, I’ve gathered this high-level summary of ML topics to share with you.

Become Job Ready Data Scientist with LunarTech

After gaining so much from this guide, if you’re keen to dive even deeper and structured learning is your style, consider joining us at LunarTech. Become job ready data scientist with The Ultimate Data Science Bootcamp which has earned the recognition of being one of the Best Data Science Bootcamps of 2023, and has been featured in esteemed publications like Forbes, Yahoo, Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. [Enroll to Free Trial of The Ultimate Data Science Bootcamp at LunarTech](enroll for free here)

[Not Just For Tech Giants: Here’s How LunarTech Revolutionizes Data Science and AI Learning
In the digital age, where the world is in constant flux, Tatev Aslanyan and Vahe Aslanyan have united to redefine AI…forbes.com.au](https://www.forbes.com.au/brand-voice/uncategorized/not-just-for-tech-giants-heres-how-lunartech-revolutionizes-data-science-and-ai-learning/ "forbes.com.au/brand-voice/uncategorized/not..")

[Outpacing Competition: How LunarTech is Redefining the Future of AI and Machine Learning |…
Opinions expressed by Entrepreneur contributors are their own. You’re reading Entrepreneur Georgia, an international…entrepreneur.com](https://www.entrepreneur.com/ka/business-news/outpacing-competition-how-lunartech-is-redefining-the/463038 "entrepreneur.com/ka/business-news/outpacing..")

[LunarTech Launches a Game Changing Data Science Education Bootcamp, Making Advanced AI and Machine…
Austin, Texas — (Newsfile Corp. — August 25, 2023) — LunarTech, an innovative online tech education platform, is…finance.yahoo.com](https://finance.yahoo.com/news/lunartech-launches-game-changing-data-115200373.html "finance.yahoo.com/news/lunartech-launches-g..")

[Best data science bootcamps for 2023
A list of what we believe are the best data science bootcamps around for levelling up your skills and landing a new…itpro.com](https://www.itpro.com/business-strategy/careers-training/358100/best-data-science-boot-camps "itpro.com/business-strategy/careers-trainin..")

Connect with Me:

[The Data Science and AI Newsletter | Tatev Karen | Substack
Where businesses meet breakthroughs, and enthusiasts transform to experts! From creator of 2023 top-rated Data Science…tatevaslanyan.substack.com](https://tatevaslanyan.substack.com/ "tatevaslanyan.substack.com")

Want to learn Machine Learning from scratch, or refresh your memory? Download this FREE Machine Learning Fundamentals Handbook

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook.

Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!

News & Insights
December 18, 2024
Open Source Work
Open Source Resources
Latest of Lunartech
LunarTech Named Top Open-Source Contributor of 2024 by freeCodeCamp