What are Quantile-Quantile (Q-Q) Plot
Give a random variable X with few observations x1,x2,x3……..x500
Is X Gaussian Distribution?
So to answer this question QQ Plots come into the field and help in answering the question
Though there are more statistical testing available such as KS test, AD test but By graphical methods using QQ plot we can answer the above question
How To Plot (Theoretically)?
- Sort Xi’s and compute percentiles
x1,x2…….x500
sort in ascending order. So it will become like
x’1,x’2,……..x’500 (Such that x’1 is less than x’2)
Now compute percentiles
after computing the percentiles, it will become like
x’5 — — -> 1stPercentile
x’10 — → 2ndPercentile
x’500 — -> 100th percentile
2. we will consider a random variable Y which has a Gaussian distribution. Let’s take 1000 samples of the same and similarly as above sort them and find their percentiles.
- After this we will plot the percentiles of random variable X on the y-axis and the percentiles of Y on the x-axis, thus forming the Quantile-Quantile plot.
Practically Implementation Using Python
import numpy as np
import pylab
import scipy.stats as stats
#N(0,1)
std_normal = np.random.normal(loc=0, scale=1, size=1000)
# 0 to 100th percentiles of std-normal
for i in range(0,101):
print(i,np.percentile(std_normal,i))
# Generate 100 samples from N(20,5)
measurements = np.random.normal(loc = 20, scale = 5, size = 100)
stats.probplot(measurements, dist=”norm”,plot=pylab)
pylab.show()
If ( y and x) for i:1->100 lie on straight line then x and y have similar family distribution.
Note: As my no. of sample increases More and more points start lying on this line.
Here we are now generating 100 samples from a uniform distribution and plotting a QQ plot against Y, which is a gaussian distribution.
#generate 100 samples from N(20,5)
measurements = np.random.uniform(low=-1, high = 1, size = 100)
stats.probplot(measurements, dist=”norm”,plot=pylab)
pylab.show()
In the above figure the distributions are in two axes X-axis: Normal and Y-axis: Uniform).
Conclusion: From the above diagram
the points do not lie on the line and hence they are moving further away from the line and at the extreme end of the graph, the points diverge the most.
Fun Part is that if we increase the sample size we will get significant difference
#generate 6000 samples from N(20,5)
measurements = np.random.uniform(low=-1, high = 1, size = 6000)
stats.probplot(measurements, dist=”norm”,plot=pylab)
pylab.show()
FINAL CONCLUSION
If most of the points are on straight line i.e., the distribution on X-axis & On Y-axis are of same family and if they don’t then the random variable X belongs to a different distribution than that we are comparing with.