January 10, 2021
Random Forest works similar to bagging except for the fact that not all features(independent variables) are selected in a subset and that random Forest works only with Decision Trees. In bagging the subsets differ from original data only in terms of number of rows but in Random forest the subsets differ from the original data both in terms of number of rows as well as number of columns. Thus making it an even more optimized approach of using Decision Trees for than bagging.
Since the all the subsets in bagging share the same columns hence, the models created from them are also correlated. Let me elaborate, say you have a dataset with p independent variable and out of them one independent variable, say P5, is quite strong. Then this variable, in bagging approach, will be present in all the subsets, hence dominate the learning process in of all the models. Thus leading to creation correlated models. Random forest checks this tendency by randomly selecting a subset of columns, therefore P5 is present in only few subsets and during aggregation the polarity caused by it is averaged out.
The number of columns that a given subset should have in Random Forest algorithm is a hyperparameter and depends on use case at hand. Research has shown that usually √p is the optimum number.
by : Monis Khan
Random Forest works similar to bagging except for the fact that not all features(independent variables) are selected in a subset and that random Forest works only with Decision Trees. In bagging the subsets differ from original data only in terms of number of rows but in Random forest the subsets differ from the original data […]