The least squares cost function is non-convex in a binary classification setting, meaning the algorithm could get stuck in a local rather than global minimum and thus fail to optimize the loss. In addition, after the logit transformation is applied, the residuals, defined as the difference between the actual and predicted values, would be infinite, since the actual values are only either 0 or 1.
Why are coefficients estimated through Maximum Likelihood (MLE) instead of Least Squares?
Help us improve this post by suggesting in comments below:
– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic
Partner Ad