Given a dataset and a target variable, there are many methods for finding a mapping (model) between the dataset and the target variable, and in the discipline of machine learning this process is called ‘supervised learning’. There are two types of supervised learning, regression when our target is numerical, and classification when our target variable is categorical. Among these, there are many types of learners and each learner has its own set of strengths and weaknesses. Some attributes we would like our learner to have are:
Qualities of a good Learner
- Interpretability: Why did the model choose this result?
- Are the predictions correct?
- Do the results generalize to new data?
- How long does it take to train this model?
- Does it require lots of data
- Robust to Irrelevancy: Do we have to perform feature selection or reduce dimensionality to obtain a good model?
- Robust to Collinearity: Will collinear variables compromise accuracy or interpretability?
- Robust to Scaling: Will scaling or transforming the data affect the results?
Unfortunately, we cannot always get everything we want, so when choosing a learner we must ask what kind of data we have and what questions we are trying to answer.
The decision trees have many of these qualities.
- Interpretability: We can see where splits are made
- Accuracy: Decision trees are accurate on average (low bias), but highly dependent on the sample (high variance). So they cannot be trusted on an individual basis
- Efficiency: Greedy algorithm is O(C n log n) where C are number of features and n are number of samples.
- Robust to Dimensionality: Dimensionality does not compromise accuracy
- Robust to Irrelevancy: Trees perform automatic feature selection based on GINI impurity or information gain. Irrelevant features just aren’t used
- Robust to Collinearity: If two variables are collinear, the can use one or the other making one of the two unimportant to the model when they are equally important
- Robust to Scaling: Scaling and many transformations e.g. log(x) would not affect where the tree splits if using GINI impurity or information gain.
Look at the upper right target in Fig. 1 and imagine these points were obtained by looking at the predictions, Xi of n different trees. Let’s assume that these trees are i.i.d, i.e. independent and identically distributed and have a mean accuracy p. This accuracy may be high (low bias), but the spread of the predictions appear to still be wide (high variance). We may calculate this variance by asking about what the probability mass function (pmf) of this process looks like. Since we constrained the trees to be independent, we can easily calculate the probability of combinations of successes and failures. For example, the probability that all n trees are correct is
while the probability that all n trees are incorrect is
in general,the probability of k trees being correct is
This is a binomial distribution.
Not being satisfied with a bunch of weak predictors, we ask what happens if we ensemble our trees by simply averaging the results? On average, every tree that predicts too far north, there should be another tree too far south to average it out reducing our variance.
Below we plot the distribution with different numbers of trees. As can be seen in the plot 1, the variance appears to decrease with more trees. An intuitive result and desirable result! More trees means we are more confident in our predictions.
We can compute the variance of these distributions out-right by multiplying eq., the probability of k estimators being correct by the squared error when only k estimators are correct.
Alternatively we could employ the following theorem:
Suppose are n independent random variables with means and variances Then, the mean and variance of the linear combination where a1, a2, … , an are real constants are:
eq. 4, eq. 5
Substituting and because iid, gives
So we see that variance of Y decreases monotonically with the number of trees, n. Below we plot eq 3 and 5 against n and see that they are appear equivalent qualitatively. As an exercise, prove eq. 3 equals eq. 5.
Since we are dealing with a sum of iid random variables we could envoke the central limit theorem CLT. This says that even though the distribution of our random variable X is not normally distributed, the distribution of sample means will be normally distributed. This can then allow us to compute how confident we are in our prediction.
where s is the Bessel correct sample standard deviation, and t is the t-score for a given confidence interval
So this is all well and good but it only works if our initial assumption holds, that is each tree is iid. How can this be if all trees are grown from the same data using the same rules? It can’t! So what can be done? To de-correlate the trees we grow each tree on a different subset of the data. Each tree sees a random subset of the samples, a technique called bagging, and a random subset of features, the random subspace method. A rule of thumb is that the number of features used to grow each tree be equal to the square root of the number of total features. This leads to a dilemma. Fewer features may result in high bias, trees that are on average not good, while selecting more features will result in highly correlated trees and high variance.
In general random forests have proven to be a great tool for data scientists. Now with a little more knowledge about the ideas behind random forests we can use them to greater affect. In a future blog post we will continue to explore how to tune the hyperparameters of a random forest and how to interpret or “clear-box” the results…