Sunday, October 2, 2022

AI/ML. Advantages of random forest

Random forests are one of the state-of-the-art supervised machine learning methods and achieve good performance in high-dimensional settings where p, the number of predictors, is much larger than n, the number of observations

Random forests is great with high dimensional data since we are working with subsets of data.


Quick Prediction/Training Speed : It is faster to train than decision trees because we are working only on a subset of features in this model, so we can easily work with hundreds of features. Prediction speed is significantly faster than training speed because we can save generated forests for future uses.


Robust to Outliers and Non-linear Data: Random forest handles outliers by essentially binning them. It is also indifferent to non-linear features.


Handles Unbalanced Data: It has methods for balancing error in class population unbalanced data sets. Random forest tries to minimize the overall error rate, so when we have an unbalance data set, the larger class will get a low error rate while the smaller class will have a larger error rate.


Low Bias, Moderate Variance

Each decision tree has a high variance, but low bias. But because we average all the trees in random forest, we are averaging the variance as well so that we have a low bias and moderate variance model.


Parallelizable

They are parallelizable, meaning that we can split the process to multiple machines to run. This results in faster computation time. Boosted models are sequential in contrast, and would take longer to compute.


Side note: Specifically, in Python, to run this in multiple machines, provide the parameter “n_jobs = -1” The -1 is an indication to use all available machines. See scikit-learn documentation for further details.


References:

https://journals.sagepub.com/doi/full/10.1177/0962280220946080

No comments:

Post a Comment