Following distance metrics can be used in KNN.
- Manhattan
- Minkowski
- Tanimoto
- Jaccard
- Mahalanobis
In K-Nearest Neighbors (KNN), various distance metrics can be used to measure the similarity or dissimilarity between data points. The choice of distance metric depends on the nature of the data and the problem at hand. Some commonly used distance metrics in KNN include:
- Euclidean Distance:
- Formula:d(x,y)=∑i=1n(xi−yi)2
- It measures the straight-line distance between two points in Euclidean space.
- Manhattan Distance (L1 Norm):
- Formula: d(x,y)=∑i=1n∣xi−yi∣
- It is the sum of the absolute differences between the coordinates.
- Minkowski Distance:
- A generalization of both Euclidean and Manhattan distances.
- Formula: d(x,y)=(∑i=1n∣xi−yi∣p)1/p
- When2p=2, it becomes Euclidean distance, and when 1p=1, it becomes Manhattan distance.
- Chebyshev Distance (L∞ Norm):
- Formula: d(x,y)=maxi(∣xi−yi∣)
- It measures the maximum absolute difference between the coordinates.
- Hamming Distance:
- Used for categorical data or binary data.
- It counts the number of positions at which the corresponding symbols are different.
- Cosine Similarity:
- More suitable for text data and high-dimensional data.
- Measures the cosine of the angle between two vectors.
- Formula: cosine_similarity(x,y)=∥x∥⋅∥y∥x⋅y
7.Jaccard Similarity / Jaccard Distance:
-
- Used for comparing the similarity and diversity of sample sets.
- For Jaccard Similarity: J(A,B)=∣A∪B∣∣A∩B∣
- For Jaccard Distance: dJ(A,B)=1−J(A,B)
When facing an interview question on distance metrics in KNN, it’s essential to discuss the characteristics of the data and the specific requirements of the problem to justify the choice of a particular distance metric. Each distance metric has its strengths and weaknesses, and the appropriate one depends on the context of the application.