The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding the attribute that returns the highest information gain (i.e., the most homogeneous branches). Step 1: Calculate entropy of the target.

In the context of machine learning and decision trees, entropy and information gain are concepts used in the construction of decision trees, particularly in the ID3 (Iterative Dichotomiser 3) algorithm.

Entropy:
- Definition: Entropy is a measure of impurity or disorder in a set of examples. In the context of decision trees, it is used to quantify the uncertainty associated with a random variable (class label in this case).
- Calculation: For a set S with n examples and k distinct classes, the entropy H(S) is calculated as follows: $H (S) = - \sum_{i = 1 k} p_{i} \cdot log_{2} (p_{i})$ where $p_{i}$ is the proportion of examples in class i in set S.
Information Gain:
- Definition: Information gain measures the effectiveness of an attribute in reducing uncertainty (entropy) about the class label. It helps in deciding the order of attributes in the construction of the decision tree.
- Calculation: The information gain $I G$ for an attribute A with respect to a set S is calculated as the difference between the entropy of the set before and after the split on attribute A: $I G (S, A) = H (S) - \sum_{v \in values (A)} ∣ S ∣ ∣ S ^{v} ∣ \cdot H (S_{v})$ where $values (A)$ is the set of possible values for attribute A, $∣ S_{v} ∣$ is the number of examples in set S with attribute A taking value v, and $H (S_{v})$ is the entropy of the subset of S for which A equals v.

In summary:

Entropy is a measure of impurity or disorder in a set.
Information Gain is the measure of the effectiveness of an attribute in reducing uncertainty about the class label.
Information Gain is calculated by comparing the entropy before and after a split based on an attribute.

In a decision tree, the attribute with the highest information gain is chosen for splitting the dataset, as it leads to the most significant reduction in uncertainty.