Shapiro Wilk Test using R


The normality of data is crucial in parametric hypothesis tests because all the parametric tests are based on normal distribution. The central limit theorem states that if the sample size is sufficiently large, the data will be normally distributed. But it may not be attained in all the cases. Typically, analysis needs to be performed on a small number of samples. As a result, checking for normality is a necessary step.

Samuel Sanford Shapiro and Martin Wilk developed a test in 1965 for checking the normality of data, which was later named after them as Shapiro-Wilk test. It is appropriate when sample size is less than 50. For data greater than that Shapiro and Francia in 1972 developed a new testing procedure. For now we will stick to Shapiro-Wilk test.

Shapiro-Wilk Test using R

Suppose we have the data on age of 18 randomly selected students in a primary school. We need to test whether the data follows a normal distribution or not.

Data -

age = c(9,12,11,10,12,9,7,8,8,7,10,10,8,7,9,10,11,12)
age
#  [1]  9 12 11 10 12  9  7  8  8  7 10 10  8  7  9 10 11 12

Graphically representing the data in a histogram -

hist(age)

From the visualization, it is not clear whether the data follows a normal distribution or not because the sample size is small.

Hypothesis formulation -
$H_0$: The data follows a normal distribution with unknown mean and variance.
$H_1$: The data does not follow a normal distribution.
We want to test the hypothesis at 5% significance level.

Hypothesis testing -

shapiro.test(x = age)
# 
# 	Shapiro-Wilk normality test
# 
# data:  age
# W = 0.92157, p-value = 0.1378

Since the p-value is 0.1378 > significance level (0.05), we may not reject the null hypothesis. It means we do not have enough statistical evidence to reject the null hypothesis that the data come from a normal distribution at 5% level of significance.

Md Ahsanul Islam
Md Ahsanul Islam
Freelance Data Analysis and R Programmer

Statistics graduate student currently researching on econometrics