You’re asked to build a random forest model with 10000 trees. During its training, you got training error as 0.00. But, on testing the validation error was 34.23. What is going on? Haven’t you trained your model perfectly?

The model is overfitting the data. Training error of 0.00 means that the classifier has mimicked the training data patterns to an extent. But when this classifier runs on the unseen sample, it was not able to find those patterns and returned the predictions with more number of errors. In Random Forest, it usually happens … Read more

You are asked to build a multiple regression model but your model R² isn’t as good as you wanted. For improvement, you remove the intercept term now your model R² becomes 0.8 from 0.3. Is it possible? How?

Yes, it is possible. The intercept term refers to model prediction without any independent variable or in other words, mean prediction R² = 1 – ∑(Y – Y´)²/∑(Y – Ymean)² where Y´ is the predicted value. In the presence of the intercept term, R² value will evaluate your model with respect to the mean model. … Read more

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

Possibly, you might get tempted to say no, but that would be incorrect. Discarding correlated variables will have a substantial effect on PCA because, in the presence of correlated variables, the variance explained by a particular component gets inflated. In the context of PCA (Principal Component Analysis), it’s not necessary to remove correlated variables beforehand. … Read more

Suppose you found that your model is suffering from low bias and high variance. Which algorithm you think could tackle this situation and Why?

Type 1: How to tackle high variance? Low bias occurs when the model’s predicted values are near to actual values. In this case, we can use the bagging algorithm (eg: Random Forest) to tackle high variance problem. Bagging algorithm will divide the data set into its subsets with repeated randomized sampling. Once divided, these samples … Read more

Q10. You are working on a time series data set. Your manager has asked you to build a high accuracy model. You start with the decision tree algorithm since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than the decision tree model. Can this happen? Why?

Time series data is based on linearity while a decision tree algorithm is known to work best to detect non-linear interactions Decision tree fails to provide robust predictions. Why? The reason is that it couldn’t map the linear relationship as good as a regression model did. We also know that a linear regression model can … Read more