What is the problem with using a generic list of stop words? 

The generic list of English stop words may not be appropriate if the set of documents are all related to a specific domain. For example, if the documents are all about finance, words like “money” and “economy” might occur very frequently and provide little differentiation between documents. If that is the case, it is a good preprocessing step to append these niche-specific words to the list of stopwords to prune from the vocabulary before performing vectorization.

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute