You are working on a classification problem. For validation purposes, you’ve randomly sampled the training data set into train and validation. You are confident that your model will work incredibly well on unseen data since your validation accuracy is high. However, you get shocked after getting poor test accuracy. What went wrong?

In case of classification problem, we should always use stratified sampling instead of random sampling. A random sampling doesn’t takes into consideration the proportion of target classes. On the contrary, stratified sampling helps to maintain the distribution of target variable in the resultant distributed samples also.

If your model performs well on the validation set but poorly on the test set, it indicates that your model might be overfitting to the validation set or there might be a mismatch between the validation set and the test set.

Here are some potential reasons for this discrepancy:

  1. Overfitting to the validation set: Your model might have learned to perform well on the specific samples in the validation set, but it fails to generalize to new, unseen data. This could happen if you repeatedly tweak your model based on validation performance without properly considering the risk of overfitting.
  2. Data leakage: There could be some form of data leakage in your validation process, where information from the test set is inadvertently leaked into the validation set, leading to overly optimistic validation results.
  3. Different distributions: The validation set might not be representative of the test set or the real-world data distribution. This can happen if the random split for validation and test sets does not adequately capture the variability present in the overall dataset.
  4. Small validation set: If your validation set is too small, it may not effectively capture the variability present in the data, leading to unreliable estimates of model performance.
  5. Inconsistent preprocessing: There might be inconsistencies or errors in preprocessing steps applied to the validation and test sets. For example, if the validation set was normalized differently from the test set, it could lead to performance discrepancies.

To address this issue, you can:

  • Re-evaluate your validation process to ensure there’s no data leakage and that it properly represents the test set.
  • Increase the size of your validation set or consider using techniques like cross-validation for more robust validation.
  • Ensure consistent preprocessing steps are applied to both the validation and test sets.
  • Investigate potential sources of overfitting and consider regularization techniques to mitigate it.
  • Analyze the differences between the validation and test sets to understand the reasons behind the performance gap.