Among the common machine learning algorithms, which require feature scaling, and which do not?

As a general rule of thumb, if any component of the objective function of the algorithm involves a distance measure, either between observations or to a central location, the data should be scaled before training the algorithm. If the algorithm is rule-based, such as a decision tree, scaling is not necessary. However, even if there is not an explicit need to do so, it is never necessarily wrong to scale the data, but the scale should be noted when it comes to interpretation. Using this heuristic, the following is a (non-exhaustive) mapping of where some of the most common algorithms fit in this regard.

Scaling is Necessary

  • Neural Networks (more so to aid in convergence of gradient descent optimizer)
  • Regularized Regression (Ridge, LASSO, Elastic Net, etc.)
  • Support Vector Machine
  • K-Nearest Neighbors
  • K-Means
  • Dimensionality Reduction (PCA, Factor Analysis)

Scaling is Not Necessary

  • Ordinary Regression (regular Linear, GLM regression w/o regularization)
    • However, if optimization is done using gradient descent, scaling data helps in convergence. 
  • Decision Tree Methods (CART, Random Forest, GBM, etc.)
  • Naive Bayes

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute