- K-Modes: K-Modes is a modification of K-Means suitable for datasets with all categorical features that clusters based on matches/mismatches across the features of the observations rather than numerical distance. The algorithm performs cluster assignment and iterates in the same way as k-means, just utilizing a different measure of similarity.
- K-Medoids (PAM Clustering): This approach, which stands for Partitioning Around Medoids, accounts for mixed data types by using a different similarity measure for numeric versus categorical features. It uses a measure called the Gower Distance to compute the partial similarities based on data type. PAM clustering is more robust to outliers compared to K-Means but can be computationally expensive on large datasets.
What are some options for clustering on categorical data? What if the dataset contains a combination of numeric and categorical features?
Help us improve this post by suggesting in comments below:
– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic
Partner Ad