Historical Developments of Random Forest

Jari El
8 min readJul 23, 2020

--

The Random Forest algorithm is a powerful and widely used algorithm that can be used for classification as well as regression. While the back end of the algorithm is interesting and somewhat complex, it is widely discussed on the internet, in books, as well as in the classroom. What is not discussed nearly enough is the historical development of this great algorithm. In this blog post, I hope to shed some insight into this topic.

As this is an application that predominantly is used by people who are looking to derive information and answers from data, this will hit home with many. Often times we are looking to analyze data that derives from various sources and try and summarize it into useful information; this can then be used to answer various questions depending on the field. This act is called data mining. According to Usama Fayyad, a data scientist and co-founder of the Associaton for Knowledge Discovery and Data Mining, data mining involves six common classes of tasks: anomaly detection, association rule learning, clustering, classification, regression, summarization, and sequential pattern mining (1996). I will explore the classification task which organizes data into classes by using predetermined class labels (Fawagreh, 2014). Dr. Khaled Fawagreh a computer engineer at Robert Gordon University (RGU), states “Classification algorithms normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model, also called a classifier. The model is then applied to predict the class labels for the unclassified objects in the testing data” (2014).

Further detail is needed in order to understand the historical development of this algorithm. Random Forest is an example of ensemble classification. This application of ensemble learning is used to boost the accuracy of classifications (Fawagreh, 2014). Ensemble learning is a machine learning archetype in which multiple models are used rather than one singular model to solve the same problem (Fawagreh, 2014). In ensemble classification applications like the random forest, multiple classifiers are used. This is because they are more accurate than the individual classifiers in the ensemble (Fawagreh, 2014). Then a voting system within the Random Forest takes place in order to determine the class label for unlabeled instances (Fawagreh, 2014). A widely used, effective but simple voting system is majority voting, an idea solidified by two professors, Ching Y. Suen and Lousia Lam (1997). In the method majority voting, each classifier in the ensemble is requested to try and actually predict the class label of the instance that is being considered (Suen, Lam, 1997). Once all the classifiers have been queried, the class that receives the most amounts of votes is returned as the final decision of the ensemble. There are other voting systems such as the veto voting scheme in which one classifier vetos the other classifiers. In order to achieve optimal results with the classifiers in the ensemble, the classifier should be both accurate and diverse (Fawagreh, 2014). What constitutes an accurate classifier is if the error rate is better than random guessing. Also, two classifiers are diverse if they are different errors on new data points (Fawagreh, 2014). The more diverse the classifiers are, the better the results are. It has actually been proven experimentally that ensembles tend to yield better results when there is a large level of diversity among the ensemble models. This method was used in Ludmila Kuncheva and Christopher Whitaker’s academic journal, Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy published in the book Machine Learning in 2003.

Now that there is a better understanding of classification algorithms and ensemble learning methods, we can get into the development of Random Forest. There were early developments that helped in the aid of the creation of the Random Forest algorithm. In 1995, Dr. Tin Kam Ho proposed a method to defeat the fundamental limitation on the complexity of decision tree classifiers that were born with the traditional methods (Fawagreh, 2014). Fawagreh states, “ Such classifiers cannot grow to arbitrary complexity without sacrificing the generalization accuracy on unseen data. The proposed method uses oblique decision trees which are convenient for optimizing training set accuracy. The essence of the method is to build multiple trees in randomly selected subspaces of the feature space. The trees generalize their classification in complementary ways, and their combined classification can be monotonically improved. (2014)”. Next, in 1997 Professor Yali Amit, a professor at the Univerisity of Chicago, and Dr. Donald Geman proposed a shape recognition approach based on the joint induction of shape features and tree classifiers in their academic article Shape Quantization and Recognition with Randomized Trees (Fawagreh, 2014). Due to the virtually infinite number of features, they came to the conclusion that no classifier based on the full feature set could be evaluated as it was impossible to determine a priori which features were informative (Fawagreh, 2014). Fawagreh states, “Due to the number and nature of features, standard decision tree construction based on fixed-length feature vectors was not feasible. An alternative approach would be to entertain small random sample features at each node, constrain their complexity to increase with tree depth, and grow multiple trees. Terminal nodes contain estimates of the corresponding posterior distribution over shape classes. By sending the image down and aggregating the resulting distribution, the image can be classified.” Although this is out of the scope of this article it will matter later in the development of random forests. In 1998 Dr. Ho proposed a method to solve the issue between overfitting while achieving maximum accuracy(Fawagreh, 2014). Fawagreh summarizes Dr. Ho’s paper the best, “This was done by constructing a decision-tree-based classifier that maintained the highest accuracy on training data and, at the same time, improved on generalization accuracy as it grows in complexity. The classifier consisted of multiple trees constructed systematically by pseudo-randomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. When empirically tested against publicly available data sets, the subspace method proved its superiority when compared to single-tree classifiers and other forest construction methods. The next section introduces RF which is an ensemble method that combines existing techniques in order to construct a collection of decision trees with controlled variation” (2014).

Now that we have a brief understanding of the research that was done prior to the Random Forest, we will now examine Random Forest themselves. Again, Random Forest is an ensemble learning method used for classification and regression. It was developed by Professor Leo Breiman in 2001. His method combines Breiman’s bagging sampling approach in 1996 and the random selection of features introduced by Dr. Ho in 1995 and 1998 and Dr. Amit and Professor Geman in 1997. This was all used to construct a collection of decision trees with controlled variation (Fawagreh, 2014). Using the bagging method, every decision tree in the ensemble is configured using a sample with replacement from the training data. Statistically speaking, the sample is likely to have about 64% of instances appearing at least once in the sample, these are called in-bag instances. The remainder of the instances which is about 36% are called out-of-bag instances. Each tree in the ensemble acts as though they are the base classifier to determine the class label of an unlabeled instance (Fawagreh, 2014). In order to best illustrate random forest as well as majority voting, we will use the diagram seen below. In this example we will use a random forest to predict the value of the PLAY feature which will determine whether a child can play or not, given the predefined values for the other features: Outlook, HWDone, and Weekend. From the training data, there are three trees; one voted no the kids cannot play while two voted yes they can, so they can play!

During the configuration of the individual trees in the Random Forest, a certain degree of randomization is also applied when selecting the best node to split. Typically, this is equal to (square-root)F where F is the number of features in the data set (Breiman, 2001). Breiman introduced additional randomness during the configuration of the decision trees using the classification and Regression Trees (CART) Technique. Using this technique, the subset of features selected in each interior node is evaluated with the Gini index heuristic (Breiman, 2001). The feature with the highest Gini index is chosen as the split feature in that specific node (Fawagreh, 2014). Gini index has been introduced by Breiman; however, it was first introduced by the Italian statistician Corrado Gini in 1912 (Fawagreh, 2014). The Gini index is a function that is used to measure the impurity of data answers the question, “How uncertain are we that an event will actually occur?” In Breiman’s 2001 paper on Random Forests, it is stated that the error rate of a Random Forest depends on correlation and strength. Increasing the correlation between any two trees will increase the forest error rate. Again, a tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the error rate (Breiman, 2001). Fawagreh states, “Such findings seem to be consistent with a study made by Bernard et al. (2010) which showed that the error rate statistically decreases by jointly maximizing the strength and minimizing the correlation.”

Now that we have an understanding of Random Forest, let us examine their applications in the real world. One of the most prominent applications is medicine. In 2008, M. Klassen, et al., conducted some experiments to explore several attribute selection methods with a Random Forest that was able to precisely classify cancer using a published benchmark data set. Fawagreh states, “Hu (2009) applied RF to study the prediction of pathologic complete response in breast cancer. Results showed that the feature selection scheme of RF was able to identify important genes of biological significance.” Another field that is not widely thought of is autopsy. In 2011, A. D. Flaxman introduced a new computer-coded verbal autopsy method using the random forest to predict the cause of death. This was done by training RF to distinguish between each pair of causes and then combining the results through a novel ranking technique. The new method outperformed physician-certified verbal autopsy and was recommended for analyzing past and current verbal autopsies (Fawagreh, 2014).

Random Forest algorithms are changing the world as we know it. Every aspect of daily living is rapidly changing. The way medicine is being conducted, the way in which we interact with agriculture, and numerous other applications. The need for understanding this algorithm is an all-time high.

I hope you found this article informative and fun to read.

--

--

No responses yet