Visually, we can check it using plots. There is a list of Normality checks, they are as follow:
- Shapiro-Wilk W Test
- Anderson-Darling Test
- Martinez-Iglewicz Test
- Kolmogorov-Smirnov Test
- D’Agostino Skewness Test
To check the normality of a dataset or a feature, you can use several methods:
- Visual Inspection:
- Histogram: Plotting a histogram of the data and visually inspecting whether it resembles a bell-shaped curve, which is characteristic of a normal distribution.
- Q-Q Plot (Quantile-Quantile Plot): Comparing the quantiles of the data against the quantiles of a theoretical normal distribution. If the points lie approximately on a straight line, it suggests normality.
- Statistical Tests:
- Shapiro-Wilk Test: It is a statistical test that evaluates whether a dataset comes from a normal distribution. It provides a p-value, and a p-value greater than a certain significance level (e.g., 0.05) indicates that the data are normally distributed.
- Kolmogorov-Smirnov Test: Similar to the Shapiro-Wilk test, it assesses the normality of a dataset by comparing its cumulative distribution function (CDF) to that of a normal distribution.
- Descriptive Statistics:
- Skewness and Kurtosis: Skewness measures the asymmetry of the distribution, while kurtosis measures the heaviness of the tails. For a normal distribution, skewness should be close to 0, and kurtosis should be close to 3.
- Normal Probability Plot:
- Plotting the ordered values of the data against the expected values of a normally distributed variable. If the data follow a normal distribution, the plot should be approximately linear.
- Jarque-Bera Test:
- This is another statistical test for normality that combines skewness and kurtosis measures. It provides a test statistic and a p-value, with a high p-value indicating normality.
It’s important to note that no single method alone can definitively confirm normality. Instead, it’s recommended to use a combination of these methods to get a more comprehensive understanding of the distribution of the data. Additionally, the choice of method may depend on the specific characteristics of the dataset and the requirements of the analysis.