Box plot method: if the value is higher or lesser than 1.5*IQR (inter quartile range) above the upper quartile (Q3) or below the lower quartile (Q1) respectively, then it is considered an outlier.
Standard deviation method: if value higher or lower than mean ± (3*standard deviation), then it is considered an outlier.
The two main methods to detect outliers in data analytics are:
- Statistical Methods: These methods involve using statistical techniques to identify observations that deviate significantly from the rest of the dataset. Common statistical methods for outlier detection include:
- Z-score: This method measures how many standard deviations an observation is from the mean of the dataset. Observations with a Z-score above a certain threshold (e.g., 2 or 3) are considered outliers.
- Modified Z-score: Similar to the standard Z-score, but robust to outliers.
- Box plot: This graphical method uses quartiles to identify potential outliers. Observations lying beyond the whiskers of the box plot are considered outliers.
- Grubbs’ test: A statistical test used to detect a single outlier in a univariate dataset.
- Machine Learning Methods: These methods involve using machine learning algorithms to learn patterns in the data and identify observations that do not conform to these patterns. Some machine learning-based outlier detection methods include:
- Isolation Forest: This algorithm isolates outliers by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
- Local Outlier Factor (LOF): LOF computes the local density deviation of a data point with respect to its neighbors. Outliers are identified as data points with significantly lower density than their neighbors.
- One-Class SVM: This algorithm learns a decision boundary around the majority of the data points and identifies observations lying outside this boundary as outliers.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN is a clustering algorithm that can also be used for outlier detection. Outliers are data points that do not belong to any cluster or are in low-density regions.
When answering an interview question about outlier detection methods, it’s essential to provide examples and discuss the strengths and weaknesses of each method, as well as considerations such as computational efficiency and suitability for different types of data distributions.