K-Means Clustering in Machine Learning

5 min readMay 10, 2021

There are many different types of machine learning.( Supervised Learning, Unsupervised Learning, Reinforcement Learning etc.) The k-means method we will cover in this article is also a type of unsupervised learning.

What is Unsupervised Learning?

In some pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Here the model is self-learning so it is much more complex than supervised learning.It aims to create subsets or groups within a dataset of data points that are truly similar to each other. the clusters obtained are homogeneous within themselves,they are expected to have a heterogeneous structure among them.

What is K-means?

K-Means Clustering is an Unsupervised Learning algorithm, used to group the unlabeled dataset into different clusters/subsets. Determines the most optimal value for K center points or centroids by a repetitive process. Assigns each data point to its closest k-center. Cluster is created with data points which are near to the particular k-center.

So what is this clustering process done according to?

The person processing the model specifies a “k” hyperparameter and all other values are clustered according to their distance from the specified cluster centers. If you noticed, the value of “k” is determined by the person. Researcher non-hierarchical clustering when it decides on the number of clusters itself method can be used. How do we calculate distance for clustering?

. Minkowski Distance

. Manhattan City-Block Distance

. Euclidean Distance

. Mahalanobis Distance

How do we choose the value of k? There are some techniques for selecting this “k” value. None has been proven to be the most suitable but the most popular one is the elbow method.

The Elbow method is the most popular in finding an optimum number of clusters, this method uses WCSS (Within Clusters Sum of Squares) which accounts for the total variations within a cluster.

WCSS= ∑Pi in Cluster1 distance (Pi C1)2 +∑Pi in Cluster2distance (Pi C2)2+∑Pi in CLuster3 distance (Pi C3)2

In the formula above ∑Pi in Cluster1 distance (Pi C1)2 is the sum of the square of the distances between each data point and its centroid within a cluster1 similarly for the other two terms in the above formula as well.

As shown in the figure, the point where the break is highest is likened to the elbow and that point is called the elbow point. For example, according to the example here, our cluster number should be 4.

Let’s see how to implement step by step now.

Step 1: The number of clusters is determined.

Step2: Random “k” center is chosen.

Step3: For each observation, the distances to the “k” centers are calculated.

Step 4: Each observation is assigned to the cluster to which it is closest.

Step5: The center calculation is made again for the clusters formed after the assignment.

Step 6: Operations are repeated for a certain number of iterations. Observations where total within-cluster variation is minimum is selected as clustering.

Note:Clusters are homogeneous within themselves and heterogeneous among themselves.

Where are k-means used?

It is used wherever there is “segmentation”.( Document Classification, Customer Segmentation, Fraud Detection etc.)Apart from this generalization, the K-Means algorithm can be used in many other applications (such as Image Recognition).

Note: If you don’t have business knowledge, do not rely on this method too much.

Note: Even when provided with the correct number of clusters, K-means clearly fails to group the data into useful clusters. So what should be done in this case? Use HDBSCAN.

Let’s look at DBSCAN first for that. DBSCAN is a density based algorithm it assumes clusters for dense regions. We have two parameters here, ε and n.We find the closest x value to the ε parameter and call this set 1.All other points that can directly reach the ε parameter are included in the 1 set. It continues to expand 1 cluster until more samples enter the cluster.All points not reachable from any other point are outliers or noise points.By repeating the same process, the set of n, that is 2, is also created.

The advantage is that it creates clusters of arbitrary shape.DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means.DBSCAN has a notion of noise, and is robust to outliers.The disadvantage is that it contains two hyper parameters. It cannot deal effectively with clusters of varying densities.If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.

HDBSCAN is a recent algorithm developed by some of the same people who write the original DBSCAN paper. Their goal was to allow varying density clusters. Finally, they found a method that overcomes the disadvantages of k-means and DBSCAN . HDBSCAN uses a density-based approach which makes few implicit assumptions about the clusters.So it can handle clusters of different densities, and the good news is that we don’t have to have just two parameters anymore. Rather than looking for clusters with a particular shape, it looks for regions of the data that are denser than the surrounding space.

Note: Always try HDBSCAN first in your data.

Let’s compare now.