Executing a binary classification tree algorithm is a simple task. But, how does a tree splitting take place? How does the tree determine which variable to break at the root node and which at its child nodes?

Gini index and Node Entropy assist the binary classification tree to take decisions. Basically, the tree algorithm determines the feasible feature that is used to distribute data into the most genuine child nodes. According to Gini index, if we arbitrarily pick a pair of objects from a group, then they should be of identical class … Read more

Suppose, you found that your model is suffering from high variance. Which algorithm do you think could handle this situation and why?

Handling High Variance For handling issues of high variance, we should use the bagging algorithm. Bagging algorithm would split data into sub-groups with replicated sampling of random data. Once the algorithm splits the data, we use random data to create rules using a particular training algorithm. After that, we use polling for combining the predictions … Read more

Both being tree-based algorithms, how is Random Forest different from Gradient Boosting Algorithm (GBM)?

The main difference between a random forest and GBM is the use of techniques. Random forest advances predictions using a technique called ‘bagging.’ On the other hand, GBM advances predictions with the help of a technique called ‘boosting.’ Bagging: In bagging, we apply arbitrary sampling and we divide the dataset into N After that, we … Read more

Why do we need a validation set and a test set?

We split the data into three different categories while creating a model: Training set: We use the training set for building the model and adjusting the model’s variables. But, we cannot rely on the correctness of the model build on top of the training set. The model might give incorrect outputs on feeding new inputs. … Read more

How can you avoid overfitting?

Overfitting happens when a machine has an inadequate dataset and it tries to learn from it. So, overfitting is inversely proportional to the amount of data. For small databases, we can bypass overfitting by the cross-validation method. In this approach, we will divide the dataset into two sections. These two sections will comprise testing and … Read more