Scaling should be done post-train and test split ideally. If the data is closely packed, then scaling post or pre-split should not make much difference.
The correct approach is to perform scaling after the train-test split. Here’s why:
- Information Leakage Prevention: Scaling before splitting may lead to information leakage from the test set to the training set, which can result in overly optimistic performance estimates. For instance, if you scale the entire dataset before splitting, the scaling parameters (mean, standard deviation, etc.) are influenced by both the training and test data. This can introduce bias and compromise the integrity of your evaluation.
- Simulation of Real-world Scenarios: Splitting the data before scaling mimics the real-world scenario more accurately. In practice, when deploying a machine learning model, you won’t have access to the entire dataset at once. Instead, you’ll train your model on historical data and then apply it to new, unseen data, which necessitates scaling separately for training and test data.
- Model Generalization: The goal of splitting the data into train and test sets is to evaluate how well the model generalizes to unseen data. If scaling is performed before splitting, the model may perform well on the test set because it has seen information from it during training (due to scaling based on the entire dataset), rather than because it can generalize well to truly unseen data.
- Cross-validation: If you further utilize techniques like k-fold cross-validation for model evaluation, performing scaling before splitting may lead to similar issues as with the train-test split, potentially inflating performance metrics and hindering model selection.
Therefore, the recommended practice is to first split your dataset into training and testing subsets and then perform scaling independently on each subset, ensuring that the scaling parameters are derived solely from the training data. This approach ensures a more reliable evaluation of model performance and enhances its ability to generalize to unseen data.