Why is KNN used to determine missing numbers?

KNN is used for missing values under the assumption that a point value can be approximated by the values of the points that are closest to it, based on other variables.

Using KNN (K-Nearest Neighbors) to determine missing numbers in a dataset is not a conventional or standard approach. KNN is primarily used for classification and regression tasks in supervised learning.

However, one could potentially use KNN for imputation of missing values in a dataset by treating it as a regression problem. In this scenario, KNN could be applied to find the ‘nearest neighbors’ of the data point with missing values and use the values from those neighbors to impute the missing values. The assumption is that similar data points tend to have similar values, hence imputing missing values with those of similar neighbors could be reasonable.

But it’s important to note that KNN for imputation has limitations, such as:

  1. Curse of Dimensionality: KNN’s performance can degrade significantly as the number of features (dimensions) increases.
  2. Sensitivity to Noise and Outliers: Outliers or noisy data points can affect the performance of KNN significantly.
  3. Computational Cost: For large datasets, calculating distances to find nearest neighbors can be computationally expensive.

In practice, more advanced imputation techniques like mean imputation, median imputation, or using more sophisticated methods like regression imputation, or matrix factorization techniques (like Singular Value Decomposition) are often preferred over KNN for handling missing values. These methods tend to be more robust and can capture more complex patterns in the data.