What is the difference between covariance and correlation?

Correlation is the standardized form of covariance. Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we’ll get different covariances which can’t be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of … Read more

While working on a data set, how do you select important variables? Explain your methods.

Following are the methods of variable selection you can use: Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set … Read more

Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

After reading this question, you should have understood that this is a classic case of “causation and correlation”. No, we can’t conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon. Therefore, there might be a correlation between global average temperature … Read more

When is Ridge regression favorable over Lasso regression?

You can quote ISLR’s authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression. Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression … Read more

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he’s true? Without losing any information, can you still build a better model?

To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value = 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of … Read more