Random forest is very very important tools used in the prediction technology. Random Forests are one of the most popular and most powerful supervised Machine Learning algorithms in Machine Language.
Advantage of random forest
- It mainly works on classification
- Both classification and regression task are evaluated
- Easily handle the missing values and maintain accuracy for missing data
- Wont overfit the model
- Handle large dataset with higher accuracy
Random Forest Pseudo code
- Assume number of cases in the training set is N. then sample of these N cases is taken at random but with replacement
- If there are M input variables or features, a number m<M is specified such that each mode, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.
- Each tree is grown to the largest extent possible and there is no pruning.
- Predicting new data by aggregating the predictions of the n trees (Majority vote for classification and average for regression).
The more trees in the forest the more robust the prediction. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high accuracy results. To model multiple decision trees to create the forest you are not going to use the same method of constructing the decision with information gain or gini index approach, amongst other algorithms. In the random forest approach, a large number of decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees and taking a majority vote for each classification model.
An error estimate is made for the cases which were not used while building the tree. That is called an OOB (Out-of-bag) error estimate which is mentioned as a percentage.
The R package “random Forest” is used to create random forests.
R: Random Forest Model for Regression
Random Forest Model for Regression is a bagged decision tree modification that creates a wide collection of de-correlated trees to increase predictive performance. Join Durga Online Trainer Institute and lean complete Data Science.