You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Possibly, you might get tempted to say no, but that would be incorrect.
Discarding correlated variables will have a substantial effect on PCA because, in the presence of correlated variables, the variance explained by a particular component gets inflated.

In the context of PCA (Principal Component Analysis), it’s not necessary to remove correlated variables beforehand. PCA itself is a technique that can handle multicollinearity (high correlation among variables) efficiently. In fact, PCA works by transforming the original variables into a new set of uncorrelated variables called principal components.

Here’s why you wouldn’t necessarily need to remove correlated variables before applying PCA:

  1. Orthogonal Transformation: PCA transforms the original variables into a new set of orthogonal (uncorrelated) variables. This means that even if the original variables are highly correlated, PCA will identify the directions of maximum variance in the data, capturing the correlations within the principal components.
  2. Dimensionality Reduction: PCA aims to reduce the dimensionality of the data by identifying the most important patterns or directions of variation. It does this by retaining the principal components that capture the most variance in the data. Removing correlated variables before PCA might discard potentially useful information for capturing this variance.

However, there are a few considerations to keep in mind:

  • Computational Efficiency: Highly correlated variables can make the covariance matrix computationally ill-conditioned, leading to numerical instability when calculating eigenvectors and eigenvalues. However, modern implementations of PCA typically handle this efficiently.
  • Interpretability: If interpretability of the principal components is important, it might be useful to understand which original variables contribute most to each principal component. In such cases, it could be beneficial to remove highly correlated variables to simplify the interpretation.

In summary, while it’s not necessary to remove correlated variables before applying PCA, it’s important to understand the trade-offs involved based on computational efficiency, interpretability, and the specific goals of the analysis.