What is Cluster Sampling?

It is a process of randomly selecting intact groups within a defined population, sharing similar characteristics. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. For example, if you’re clustering the total number of managers in a set of companies, in that case, managers (samples) will represent … Read more

What are collinearity and multicollinearity?

Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation. Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated. In the context of machine learning and statistics, collinearity and multicollinearity refer to the presence of strong correlations between predictor variables in a … Read more

What is Overfitting? And how do you ensure you’re not overfitting with a model?

Over-fitting occurs when a model studies the training data to such an extent that it negatively influences the performance of the model on new data. This means that the disturbance in the training data is recorded and learned as concepts by the model. But the problem here is that these concepts do not apply to … Read more

What is the difference between Entropy and Information Gain?

Entropy is an indicator of how messy your data is. It decreases as you reach closer to the leaf node. The Information Gain is based on the decrease in entropy after a dataset is split on an attribute. It keeps on increasing as you reach closer to the leaf node. Entropy and Information Gain are … Read more

What is the difference between Gini Impurity and Entropy in a Decision Tree?

Gini Impurity and Entropy are the metrics used for deciding how to split a Decision Tree. Gini measurement is the probability of a random sample being classified correctly if you randomly pick a label according to the distribution in the branch. Entropy is a measurement to calculate the lack of information. You calculate the Information … Read more