Statistical significance

posted in: concepts, jupyter, notebooks | 1

To understand statistical significance, we need to understand the difference between a population and a sample. A sample is an individual or a group from a population that is representative in nature. Statistical significance is the likelihood (or probability) that an effect observed or a statistic derived from a sample represents a phenomenon occurring in the population from which the sample is chosen. [1] This probability is the p-value.

We use the p-value to either accept or reject a baseline model or a null hypothesis (H0). A null hypothesis is an absence of an effect or an effect observed that occurred purely by chance. Significance level (α) is the threshold value of p-value, iff p-value is less than the significance level (α) then we reject the null hypothesis (H0). This approach is called null hypothesis significance testing [0] We generally accept α = 0.05

In [1]:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
from scipy import stats

plt.style.use('ggplot')
In [2]:
mean = 0
sigma = 1
alpha = 0.05
confidence_level = 1-alpha
In [3]:
x = np.linspace(mean - 5*sigma, mean + 5*sigma, 1000)
iq = stats.norm(mean, sigma)

plt.plot(
    x,
    iq.pdf(x),
    color = 'black')

#confidence intervals
conf_interval_a, conf_interval_b = norm.interval(
    confidence_level,
    loc = mean,
    scale = sigma)

px = x[np.logical_and(x >= conf_interval_a, x <= conf_interval_b)]

plt.fill_between(
    px,
    iq.pdf(px),
    color = '#d48181')

plt.show()
Fig 1. μ=0 and σ=1. The rejection region (not colored) in a two tailed test for a significance level of 0.05, which makes up for 5% of the area under the curve at both the ends.

If we conduct a statistical test on a sample of data from a population, if the p-value is less than α then we reject the null hypothesis and we can be reasonably sure that an effect we observed in the sample data is an genuine phenomenon in the population. We then say that the test statistic is statistically significant at level α. If the p-value is greater than α then we accept the null hypothesis and we can say that the observed effect was by chance and/or a sampling error. This process of statistical significance, guarantees that our type I (false positive) error rate is less than α.

References:
[0] Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012 p-215.
[1] Urdan, Timothy C. Statistics in plain English. Routledge, 2011. p-61.

One Response

  1. […] Next >> Statistical significance […]

Leave a Reply