What is Principal Component Analysis (PCA), and how does it differ from clustering?

Principal Component Analysis (PCA) is a dimension reduction technique that explains the variability across multiple dimensions of data through linear combinations of the original features. Each new linear combination that is created is referred to as a principal component, and the components have the property of being mutually orthogonal, or uncorrelated, to one another.

The first principal component always explains the highest percentage of variability among the features, and each subsequent component explains less. If there are k original features, there can be up to k principal components created, but as it is a reduction technique, the number of components chosen is usually much smaller and can be determined using a similar heuristic technique as the elbow plot in k-means clustering, in the case of PCA based on the cumulative proportion of variance explained. The main difference between clustering and PCA is that clustering attempts to find groupings among the observations, or rows, where PCA performs reduction among the features, or columns. However, there are several similarities between the two, namely the fact that both are unsupervised learning methods that require user interpretation to derive practical meaning from the results. 

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute