If the documents in the corpus are of varying sizes, the larger documents are more likely to have higher word counts across the vocabulary simply due to them containing more words. In that case, normalization can scale the word counts to a more even level across all documents. Techniques for normalizing text vectors include L2 normalization or dividing by the number of tokens in the document, which roughly corresponds to the rate of occurrence of a given token in a document.
What is Vector Normalization? How is that useful?
Help us improve this post by suggesting in comments below:
– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic
Partner Ad