K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.
In K-mean algorithm,
- The clusters are spherical: the data points in a cluster are centered around that cluster
- The variance/spread of the clusters is similar: Each data point belongs to the closest cluster
The K-means algorithm is a popular unsupervised machine learning technique used for clustering data into groups. Here’s a concise explanation:
- Initialization: The algorithm starts by randomly selecting K points from the dataset as the initial centroids. These centroids serve as the centers of the clusters.
- Assignment: Each data point is assigned to the nearest centroid based on a distance metric, typically Euclidean distance. This step partitions the data into K clusters.
- Update centroids: After the assignment step, the centroids are recalculated as the mean of all data points assigned to each cluster. This step moves the centroids to better represent the center of their respective clusters.
- Repeat: Steps 2 and 3 are repeated iteratively until either the centroids converge (i.e., stop changing significantly) or a predefined number of iterations is reached.
- Convergence: The algorithm converges when the centroids no longer change significantly between iterations or when a specified number of iterations has been completed.
- Final Result: Once the algorithm converges, each data point will be assigned to the cluster corresponding to its nearest centroid, and the centroids will represent the final cluster centers.
K-means aims to minimize the within-cluster sum of squares (WCSS), also known as inertia or distortion, which measures the compactness of the clusters. However, the algorithm is sensitive to the initial placement of centroids and can converge to local optima. Therefore, it’s common practice to run the algorithm multiple times with different initializations and choose the clustering with the lowest WCSS. Additionally, the choice of the number of clusters, K, is crucial and often requires domain knowledge or techniques such as the elbow method or silhouette analysis to determine the optimal value.