How To Randomly Sample Data Points (Uniform Distribution)
Pseudo Random Number Generators
Previously we had studied in C & Java we can generate random numbers
In C we have a library stdlib in which we have a function called rand() which generates uniform random numbers.
Implementation In Python
Basically, It displays a value from 0 and 1 and picks a value uniformly at random.
Random numbers are uniformly distributed
we can generate random numbers which are non-uniformly distributed but it is not explicitly called out. Most number generators are called uniform distribution.
If we plot this as,
Problem: Let’s say i have dataset n datapoints and i want to sample m points from this (I want uniformly sample)
What does Uniformly sample means?
It means, each point in my initial dataset
D = n datapoints
Each point should have equal chance of belonging to my new D’ dataset.
suppose, my n has 150 points
When i am sampling 30 points from this, each point should have equal chance of belonging to my new dataset D’
Example: Let’s See with IRIS DataSet
here n=150points(IRIS DATASET) 4 → 4-dimensional data i.e., Petal Length, Petal Width, Sepal Length, Sepal Width.
Now, Imagine i want to sample 30 points randomly
Let’s understand what’s happening
D = x1,x2,…..x150
sampling the dataset
D’ = x1',x2',…….x30
Since, I have 150 points and i want to generate dataset with 30 points.
So, Probability of each point belonging to my dataset D’
30/150 = 0.2
so random.random() generates datapoints between 0 and 1 and if conditions checks which datapoints are below 0.2
sampled_data.append(d[i,:]) → this append to the final list, neglecting datapoints greater than 0.2