How prediction is made in Random Forest?

Random Forest works similar to bagging except for the fact that not all features(independent variables) are selected in a subset and that random Forest works only with Decision Trees. In bagging the subsets differ from original data only in terms of number of rows but in Random forest the subsets differ from the original data both in terms of number of rows as well as number of columns.

A tree is model is constructed for each subset and their results are aggregated. The method of aggregation depends upon the type of problem at hand.

  1. For classification problems, voting is used for aggregation
  2. For regression problems, mean/average is used for aggregation.

What is feature sampling?

Feature Sampling means creating subsets of dataset based on columns i.e. each subset will contain some columns of a original dataset. Number of rows may or may not be same as original dataset, but usually feature sampling in done in conjunction with instance sampling and the subsets have both lesser number of columns as well as lesser number of rows.

What is the difference between Bagging and Random forest? Why do we use Random forest more commonly than Bagging?

Random Forest works similar to bagging except for the fact that not all features(independent variables) are selected in a subset. Secondly Random Forest works only with Decision Trees, Whereat in bagging any algorithm can be used. In bagging the subsets differ from original data only in terms of number of rows but in Random forest the subsets differ from the original data both in terms of number of rows as well as number of columns. Thus making it an even more optimized approach of using Decision Trees for than bagging.

Since the all the subsets in bagging share the same columns hence, the models created from them are also correlated. Let me elaborate, say you have a dataset with p independent variable and out of them one independent variable, say P5, is quite strong. Then this variable, in bagging approach, will be present in all the subsets, hence dominate the learning process in of all the models. Thus leading to creation correlated models. Random forest checks this tendency by randomly selecting a subset of columns, therefore P5 is present in only few subsets and during aggregation the polarity caused by it is averaged out. Thus facilitating better learning and more accurate predictions.

Therefore Random Forest is preferred over bagging when it comes to using ensemble technique on Decision Trees.

How does a Random Forest model works?

Random Forest works similar to bagging except for the fact that not all features(independent variables) are selected in a subset and that random Forest works only with Decision Trees. In bagging the subsets differ from original data only in terms of number of rows but in Random forest the subsets differ from the original data both in terms of number of rows as well as number of columns. Thus making it an even more optimized approach of using Decision Trees for than bagging.

Since the all the subsets in bagging share the same columns hence, the models created from them are also correlated. Let me elaborate, say you have a dataset with p independent variable and out of them one independent variable, say P5, is quite strong. Then this variable, in bagging approach, will be present in all the subsets, hence dominate the learning process in of all the models. Thus leading to creation correlated models. Random forest checks this tendency by randomly selecting a subset of columns, therefore P5 is present in only few subsets and during aggregation the polarity caused by it is averaged out.

The number of columns that a given subset should have in Random Forest algorithm is a hyperparameter and depends on use case at hand. Research has shown that usually √p is the optimum number.

What is Out Of Bag evaluation?

More often than not bagging covers only 90% to 95% of the dataset thus there are some 5% to 10% of the instances which don’t form part of any subset and hence no training is performed on them. Thus making these instances ideal for the purpose of cross validation. Such instances are called Out of Bag instances. When cross validation is performed using Out of Bag instances, it is termed as Out of Bag evaluation.

What is pasting? How is it different from bagging?

Pasting is an ensemble technique similar to bagging except for the fact that in pasting sampling is done without replacement i.e. an observation can be present in only one subset. Since pasting limits diversity of models its performance with is suboptimal when compared to bagging, particularly in case of small datasets. However, pasting is preferred over bagging in case of so large datasets, owing to computational efficiency.

How Ensemble technique solves the high variance issue with Decision trees?

Decision Trees are notoriously famous for overfitting. Bagging exploit this weakness of Decision Trees to its advantage. Each tree is fed with a different subset of data and the make predictions which are significantly different from one another. By aggregating their outputs bagging averages out the polar tendencies and thus significantly improves generalization.

How prediction is made in Bagging?

Bagging makes prediction using the following steps:

  1. Bootstrap method is used to create different subsets of data
  2. A model is created for each subset
  3. The output of each algorithm is aggregated to make the final prediction. In classification problems the voting is used as aggregation method. While in regression problems mean is used as method of aggregation.

It should be noted that in bagging only one algorithms is used to create all the models. The model outputs are different from one another owing to difference in data fed to them and not because of difference in algorithm used.

 

What is bagging?

Bagging is a combination of two words – bootstrap and aggregation. It leverages the benefits of both, by bootstrapping it creates different subsets of data, creates a model for each subset an then aggregates the output of each algorithm to make the final prediction. It should be noted that in bagging only one algorithms is used to create all the models. The model outputs are different from one another owing to difference in data fed to them and not because of difference in algorithm used.

In classification problems the voting is used as aggregation method. While in regression problems mean is used as method of aggregation.

What is Bootstrapping? How is sampling done in bootstrapping?

Bootstrapping is a subset of resampling techniques, here different subsets of data are sampled with replacement i.e. an observation can be present in more than one subset. It particularly useful for small datasets, where resampling virtually increases the amount of available data.

Sampling in Bootstrapping is based on two principles:

  1. Random Selection: Each subset is chosen on a random basis. The level of randomness is such that more often than not some rows are not able to make it to any of the subsets.
  2. Sampling with Replacement: An observation can be sampled more than one i.e. it can be present in multiple subsets of the data simultaneously.