Explain the idea behind ensemble techniques.

Ensemble technique involves taking information from various sources before making a decision. We often employ it in our daily lives for ex: before buying a products we read several customer reviews, even our political system is based on ensemble method where choice(vote) of every elector is considered before choosing the candidate for the office.

Similarly when employing ensemble technique in machine learning, instead of using a single model, we use combine the output of several models to make the final prediction. How the outputs are combined depends on the type of ensemble technique employed.

Following are the advantages of using ensemble techniques:

  1. Model Selection: There are times when several algorithms are equally suitable to solve the problem and we’re not sure which one to chose. By employing ensemble learning we hedge our bets.
  2. Too Much or Too Little Data: If the dataset is large and has huge variance a single model won’t be able to identify and encapsulate all the patterns present in the data. By dividing it in subsets and using separate models suited to each subset we can arrive at much better results. On the other hand if the dataset is too small, ensemble techniques like bootstrapping that employ resampling can be of immense help.
  3. Complex Pattern: If the patterns present in the dataset are too complex to be modelled by a single equation, we can divide the dataset in a manner that each subset ends up with a simpler pattern. Then for each subset an appropriate algorithm can be used to model its pattern.
  4. Multiple Sources: In many real world scenarios we receive data from multiple sources that provide complimentary information. By using models for each source and combing their output we are able to get fuller view of the actual picture than by relying on model based on a single source.
  5. Increased Confidence: If multiple models make the same prediction, especially in classification problems, then we’re much more certain about the accuracy of that prediction.

Following are the types of ensemble techniques:

  1. Bagging
  2. Boosting
  3. Blending
  4. Stacking
  5. Blending

How bias and variance varies for each CV method?

Cross validation is of three types:

  1. Hold Out: Here you split you original data into training and test(hold out) sets. The model is trained on training set and then overfitting is checked on test set. The disadvantage of this method is that in case of smaller datasets if the randomly selected test set is biased then it will permeate the bias to the model as well. Thus the overfitting estimates provided by this method have huge variance depending upon the manner in which data was split. The bias is relatively low in this case.
  2. Leave One Out: This is taking K Fold method to its extreme where k is equal to number of observations. The overfitting estimate provided by this method is good but it requires huge computation power. The disadvantage of this method is that it has high bias as all the models are trained on almost similar dataset. For the same reason the variance associated with this method is low
  3. K Fold: This is performed in addition to ‘Hold Out’ method. Here the training set is divided into k subsets and the model is trained on (k-1) subsets and tested on the remaining subset. This process is repeated k time, i.e. on all the subsets, and the final value is the average of the k iterations. This is essentially repeating the ‘Hold Out’ method k times on the training set. Since there is overlap in training sets, this method has a moderate bias. Since the validation sets differ significantly from one other the variance is lower than LOO method.

What are different types of CV methods?

Cross validation is of three types:

  1. Hold Out: Here you split you original data into training and test(hold out) sets. The model is trained on training set and then overfitting is checked on test set. the advantage of this method is that it’s inexpensive in terms of computation time, secondly it’s applied across the board. The disadvantage of this method is that in case of smaller datasets if the randomly selected test set is biased then it will permeate the bias to the model as well. Thus the overfitting estimates provided by this method have huge variance depending upon the manner in which data was split.
  2. K Fold: This is performed in addition to ‘Hold Out’ method. Here the training set is divided into k subsets and the model is trained on (k-1) subsets and tested on the remaining subset. This process is repeated k time, i.e. on all the subsets, and the final value is the average of the k iterations. This is essentially repeating the ‘Hold Out’ method k times on the training set. The disadvantage of this method is that the algorithm is to run k times and thus is computationally intensive.
  3. Leave One Out: This is taking K Fold method to its extreme where k is equal to number of observations. The overfitting estimate provided by this method is good but it requires huge computation power.

What is Cross validation?

Cross validation essentially means checking the prediction accuracy of your model on unseen data i.e. the data on which model wasn’t trained on. Cross validation is performed to ensure that the model doesn’t overfits. If the prediction accuracy of your model on training data and unseen data have similar value, then your model is good to go. if the prediction accuracy on unseen data is significantly less than that on training data, then your model is overfitting.
Cross validation is of three types:

  1. Hold Out: Here you split you original data into training and test(hold out) sets. The model is trained on training set and then overfitting is checked on test set. The disadvantage of this method is that in case of smaller datasets if the randomly selected test set is biased then it will permeate the bias to the model as well. Thus the overfitting estimates provided by this method have huge variance depending upon the manner in which data was split.
  2. K Fold: This is performed in addition to ‘Hold Out’ method. Here the training set is divided into k subsets and the model is trained on (k-1) subsets and tested on the remaining subset. This process is repeated k time, i.e. on all the subsets, and the final value is the average of the k iterations. This is essentially repeating the ‘Hold Out’ method k times on the training set. The disadvantage of this method is that the algorithm is to run k times and thus is computationally intensive.
  3. Leave One Out: This is taking K Fold method to its extreme where k is equal to number of observations. The overfitting estimate provided by this method is good but it requires huge computation power.

What are the disadvantages and advantages of using a Decision Tree?

Following are the advantages of Decision Trees:

  1. They are able to identify and model complex patterns.
  2. Work well with both classification and regression problems
  3. Unaffected by outliers
  4. Easier to explain to non technical stakeholders. Complex Decision Trees can be explained just by creating their visual representations.
  5. Scaling and normalization are not needed

Following are the disadvantages of Decision Trees:

  1. Sensitive to overfitting
  2. Small change in data can cause instability in the model owing to use of recursive binary splitting
  3. They are computationally more intensive and take longer time to train than other classification algorithms.

What are different algorithms available for Decision Tree?

Following are the algorithms that can be used to construct a Decision Tree:

  1. Iterative Dichotomizer (ID3): This algorithm uses Information Gain as the basis for selecting root nodes and splitting them. It is used for classification problems and works only with categorical data.
  2. C4.5: This is an improved version of ID3 algorithm as it works with both categorical and continuous variables. It is used for classification problems.
  3. Classification and Regression Algorithm(CART): This is the most popular algorithm for constructing Decision Trees. It uses Gini Impurity, as default value, for selecting and splitting root nodes, but also works with Information gain. It can be used for both regression and classification problems

How does node selection take place while building a tree?

Decision Trees start select a root node based on a given condition and split the data into non overlapping subsets. Then the same process is repeated on the newly formed branches and the process continues till we reach desired result i.e. answer to our business question. The condition for splitting is the decided by the value of cost function. The node, splitting which would lead to maximum reduction in value of cost function, at that stage, is chosen.

Here the cost function can either be entropy or Gini Impurity. Performance wise both are similar buy Gini Impurity is chosen when dealing with large datasets, owing to it bein less computationally intensive.

What do you understand by Information Gain? How does it help in tree building?

information gain is equal to reduction in entropy. It measures the amount of reduction in randomness/entropy after decision tree has made the split. It is the difference in entropies before and after the split.

Where, T is the parent node before split and X is the split node from T.

Information Gain is the cost function that decision trees employ as basis of splitting the data, if the the split leads to increase in Information Gain then it’s carried out else not.

What is Gini Impurity?

According to Wikipedia, ‘Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.’

Like entropy, Gini Impurity too is a measure of randomness of data. Randomness signifies the heterogeneity of labels. Decision trees split the data in manner that leads to decrease in Gini Impurity. Thus Decision Trees aim to divide the data with heterogenous labels into subsets/sub-regions of data with homogenous labels. Thus with each division level of homogeneity increases and Gini Impurity decreases. In fact Gini Impurity is the cost function that decision trees employ as basis of splitting the data, if the the split leads to decrease in Gini Impurity then it’s carried out else not.

For a dataset with n classes, the formula for Gini Impurity would be:

Where p is the probability of occurrence of each class.