Being that clustering is a distance-based algorithm, outliers can have multiple undesired effects on the quality of the clusters produced. Being the objective of K-Means is to minimize the within cluster sum of squares, or distance from each observation to the cluster’s centroid, outliers that are far from the centroids will prevent the objective from achieving a minimum compared to if they were not present. It is also possible that the presence of a small number of outliers can result in clusters that only contain a few observations, which can obscure the practical conclusions of what the clusters represent. This further emphasizes the importance of scaling the data before a clustering algorithm is trained, but even after scaling, noticeable outliers should be investigated further.
How do outliers affect the clusters formed in K-Means?
Help us improve this post by suggesting in comments below:
– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic
Partner Ad