Data leakage occurs when information outside the scope of the training data is used in the model building process. This can induce unintended and unknown bias into the model that might not be discovered until it is not performing as intended when put into production. The best way to safeguard against data leakage is to have a robust validation procedure that ensures no portion of the validation data is used anywhere during the training process.
What is Data Leakage?
Help us improve this post by suggesting in comments below:
– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic
Partner Ad