January 16, 2021
Following were the improvements made in K-means:
- Since initial centroids are arbitrarily chosen, the results for earlier versions were not exactly replicable. K-Means++ proposed a new method of assigning centroids where the initial points were furthest from each other i.e. fixed and replicable locations.
- Earlier versions were computationally expensive as distance was to be calculated from each centroid to all data points and then compared for finding the nearest centroid. Leveraging the triangle inequality theorem, i.e. sum of two sides of a triangle is always greater than third side, the amount of computation was reduced.
- Earlier versions showed out of memory error for large datasets. This was solved by using mini batches. The algorithm is run for each mini batch and the centroid move incrementally with each new batch. This increases the speed of computation by around 3 times.
by : Monis Khan
Quick Summary:
Following were the improvements made in K-means: Since initial centroids are arbitrarily chosen, the results for earlier versions were not exactly replicable. K-Means++ proposed a new method of assigning centroids where the initial points were furthest from each other i.e. fixed and replicable locations. Earlier versions were computationally expensive as distance was to be calculated […]