What is meant by Corpus and Vocabulary in Natural Language Processing?

A corpus of text is the entire set of documents considered. The meaning of a document in Natural Language Processing is very specific to the context, as the text being analyzed could be entire journal articles or short movie reviews. A single sentence that can fit into a Dataframe can even be considered a document. The vocabulary refers to the union of all words that appear throughout the entire corpus. For example, in the following corpus

  1. It is cold outside today.
  2. I love the beach.
  3. Pizza is for lunch today.

The vocabulary would be {It, is, cold, outside, today, I, love, the, beach, Pizza, for, lunch}. 

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute