What is Entropy? How is it calculated?

Entropy is the measure of randomness in data. Randomness signifies the heterogeneity of labels. Decision trees split the data in manner that leads to decrease in entropy. Thus Decision Trees aim to divide the data with heterogenous labels into subsets/sub-regions of data with homogenous labels. Thus with each division level of homogeneity increases and entropy decreases. In fact entropy is the cost function that decision trees employ as basis of splitting the data, if the the split leads to decrease in entropy then it’s carried out else not.

For a dataset with 4 labels- say a, b, c, d – with probability of occurrence of each label being p, q, r, s respectively, then the entropy of the data would be given by the following equation:

For a dataset with n classes, the formula would be:

Where p is the probability of occurrence of each class.

What’s the difference between pre pruning and post pruning?

Decision trees are notoriously famous for overfitting. Pruning is a regularization method which penalizes the length of tree, i.e. increases the value of cost function.

Pruning is of two types:

  1. Post Pruning(Backward Pruning): Full tree is generated and then the non-significant branches are pruned/removed. Cross validation is performed at every step to check whether addition of the new branch leads to increase in accuracy. If not the branch is converted to leaf node.
  2. Pre Pruning(Forward Pruning): This approach stops the non-significant branches from generating. It terminates the generation of new branch based on the given condition.

What is Pruning?

Decision trees are notoriously famous for overfitting. Pruning is a regularization method which penalizes the length of tree, i.e. increases the value of cost function.

Pruning is of two types:

  1. Post Pruning(Backward Pruning): Full tree is generated and then the non-significant branches are pruned/removed.
  2. Pre Pruning(Forward Pruning): This approach stops the non-significant branches from generating.

Why Recursive binary splitting is called Greedy Approach?

What is greed and why is it there a negative connotation attached to it?

Wikipedia defines greed as: Greed (or avarice) is an uncontrolled longing for increase in the acquisition or use: of material gain (be it foodmoney, land, or animate/inanimate possessions); or social value, such as status, or power. Greed has been identified as undesirable throughout known human history because it creates behavior-conflict between personal and social goals.”

But other terms – like ambition, savings, self preservation, achievement etc. – which are also associated with increase in personal worth, increase in status etc. are considered virtues. So longing for personal gain is in itself not bad, so it must be other traits that make greed a vice.

Following are the factors which make greed undesirable:

  1. Short Sightedness: Prioritizing instant gratification over long term gains.
  2. Negative Externalities:  Greed not only hurt the your long term interests but also adversely affect others.

 

How Decision Trees work?

Decision Trees start select a root node based on a given condition and split the data into non overlapping subsets. Then the same process is repeated on the newly formed branches and the process continues till we reach desired result i.e. answer to our business question. The condition for splitting is the decided by the value of cost function. The condition corresponding to which the value of cost function, at that stage, is minimum is chosen. What I mean by at that stage is that overall value of cost function is not taken into account but the value of cost function at that level i.e. instant gratification. This approach may result in a tree for which overall value cost function may be higher than a tree in which splitting was done on the basis of overall value of cost function. Therefore this approach creates a tree whose accuracy is less that that of a tree where overall cost function was employed i.e. negative externalities. Therefore this approach is termed as greedy approach

 

Why is this approach chosen if it results in a suboptimal tree?

If the overall cost function was considered at every step, the process would have been highly computationally intensive. Thus by using greedy approach decision trees save both computation time and hardware costs.

What is AUC, and when is it used?

For selecting the right model for a given business problem, we take the following steps:

  1. Plot ROC for a given algorithm for different threshold
  2. Do this for all the other algorithms
  3. Select the algorithm which has the largest area under Receiver Operator Curve.

This is the general approach but you need not always consider the whole receiver operator curve, you can chose the curve up till the threshold suited for your business problem.

Explain the significance of ROC.

Receiver Operator Curve (ROC) is used for finding the optimum threshold for Sensitivity. Sensitivity or Recall is a measure of rate of ‘True Positives’. It plots Sensitivity against rate of ‘False Positives’ i.e. against (1-Specificity). Research has shown that up till a certain threshold both ‘True Positive’ rate and ‘False Positive’ rate increases but beyond that ‘True Positive’ rate becomes constant i.e. indifferent to increments in ‘False Positive’ rate.

Thus identifying this threshold becomes imperative for selecting/identifying the best hyperparameters for a given algorithm. In some business problems you want to minimize ‘False Positives’, even at the expense capability to identify ‘True Positives’. In such cases ROC curve helps you to stem the ‘False Negative’ rate at the desired value.

What is specificity?

Honorable Madras High Court said that-“1000 Culprits Can Escape but, One Innocent Should Not be Punished “. More or less legal systems around the world follow this principle.

The above motto shows that accuracy is not always the metric that is been sought out for. If you were to design a model that could assist the court in making decisions, then that model should also incorporate this dictum. The primary goal of your model would be to ensure that no innocent be predicted as guilty.

Further, in case of imbalanced datasets accuracy value can be sometimes misleading. Let me illustrate my point with an example, let’s say you have to build a model that has to predict if an accused is guilty or not based on a given set of evidence. If the training data is unbalanced with overwhelming number of guilty cases (Crime Positive) and only a few innocents (Crime Negative), then your model would incorporate this bias. If the test data you use has 1000 rows in which there are 950 guilty (Crime positive) cases and 50 are innocent (Crime Negative). Now, if your model predicts 990 guilty cases (Crime Positive) and 10 innocent (Crime Negative) cases, then your model horribly fails in its objective. It goes against the established legal edict, it incarcerates 80% of the innocents.

Therefore you need a metric that measures how accurately the innocents are vindicated/incarcerated i.e. identifies true Crime Negative or True Negative rate. Here comes in Specificity, it measures exactly that. Specificity is used to check the reliability of the model.

How does the tradeoff between recall and precision work?

We know that for unbalanced datasets we can’t rely on accuracy. Here we have to measure the performance of the model using Precision & Recall. The ideal scenario would be where both Precision & Recall are high. But research has shown that the two metrics are somewhat inversely proportional to each other. I said somewhat because, Precision does decrease with increase in Recall, but beyond a certain value of Recall, Precision value becomes constant i.e. indifferent to change in recall. Following graph demonstrates the relationship beautifully: