How do we check the normality of a data set or a feature?

Visually, we can check it using plots. There is a list of Normality checks, they are as follow:

  • Shapiro-Wilk W Test
  • Anderson-Darling Test
  • Martinez-Iglewicz Test
  • Kolmogorov-Smirnov Test
  • D’Agostino Skewness Test

To check the normality of a dataset or a feature, you can use several methods:

  1. Visual Inspection:
    • Histogram: Plotting a histogram of the data and visually inspecting whether it resembles a bell-shaped curve, which is characteristic of a normal distribution.
    • Q-Q Plot (Quantile-Quantile Plot): Comparing the quantiles of the data against the quantiles of a theoretical normal distribution. If the points lie approximately on a straight line, it suggests normality.
  2. Statistical Tests:
    • Shapiro-Wilk Test: It is a statistical test that evaluates whether a dataset comes from a normal distribution. It provides a p-value, and a p-value greater than a certain significance level (e.g., 0.05) indicates that the data are normally distributed.
    • Kolmogorov-Smirnov Test: Similar to the Shapiro-Wilk test, it assesses the normality of a dataset by comparing its cumulative distribution function (CDF) to that of a normal distribution.
  3. Descriptive Statistics:
    • Skewness and Kurtosis: Skewness measures the asymmetry of the distribution, while kurtosis measures the heaviness of the tails. For a normal distribution, skewness should be close to 0, and kurtosis should be close to 3.
  4. Normal Probability Plot:
    • Plotting the ordered values of the data against the expected values of a normally distributed variable. If the data follow a normal distribution, the plot should be approximately linear.
  5. Jarque-Bera Test:
    • This is another statistical test for normality that combines skewness and kurtosis measures. It provides a test statistic and a p-value, with a high p-value indicating normality.

It’s important to note that no single method alone can definitively confirm normality. Instead, it’s recommended to use a combination of these methods to get a more comprehensive understanding of the distribution of the data. Additionally, the choice of method may depend on the specific characteristics of the dataset and the requirements of the analysis.