Computer Vision (1)
Generative AI (2)
Machine Learning Basics (18)
Deep Learning (52)
- DL Basics (16)
- DL Architectures (17)
  - Feedforward Network / MLP (2)
  - Sequence models (6)
  - Transformers (9)
- DL Training and Optimization (17)
Natural Language Processing (27)
- NLP Data Preparation (18)
Supervised Learning (115)
- Regression (41)
  - Linear Regression (26)
  - Generalized Linear Models (9)
  - Regularization (6)
- Classification (70)
  - Logistic Regression (10)
  - Support Vector Machine (9)
  - Ensemble Learning (24)
  - Other Classification Models (9)
  - Classification Evaluations (9)
Unsupervised Learning (55)
- Clustering (37)
  - Distance Measures (9)
  - K-Means Clustering (9)
  - Hierarchical Clustering (3)
  - Gaussian Mixture Models (5)
  - Clustering Evaluations (6)
- Dimensionality Reduction (9)
Statistics (34)
Data Preparation (35)
- Feature Engineering (30)
- Sampling Techniques (5)

What happens to new words that appear in Test dataset but are not present in Training Data?

Updated: October 3, 2023

Similar to other preprocessing techniques, it is considered best practice to fit the vectorizer on the train dataset and then transform the test dataset using the parameters learned from only the training data. If a word appears in the test dataset that was not seen when the vectorizer was fit to the training data, it will essentially be ignored, as it was not part of the vocabulary learned by the vectorizer. One work around to this issue is to create a rule that assigns the rarest tokens to an umbrella word that encompasses all such words in the vocabulary, sort of like creating an “Other” category when performing binning or discretization. Ultimately, it is desired to perform a train/test split in such a manner so this does not occur.

Author

AIML.com

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment (Cancel Reply)

You must be logged in to post a comment.

Partner Ad

Join us on:

Find out all the ways that you can

Contribute

Partner Ad

Learn Data Science with Travis - your AI-powered tutor | LearnEngine.com

What happens to new words that appear in Test dataset but are not present in Training Data?

Author

Leave the first comment (Cancel Reply)

Other Questions in NLP Data Preparation