Ensemble Models for S.

My best mate is studying machine learning and found the differences between Random Forests and Gradient Boosting Machines a bit opaque.

My promise to him was to explain it as simple as possible, so here is a short post explaining both models as simple as possible in code!


Run the code in Google Colab: Open In Colab


Hint: Try the GPU runtime! JAX will calculate the gradients using the GPU or the CPU!

Preliminaries

We will make one abstraction: I won’t be coding up the decision trees from scratch and instead use the scikit-learn implementation. This removes the complexities related to growing the trees and lets us focus on understanding the different way of creating the models. In a future post, we’ll talk about how trees are grown and pruned, and how feature importance is calculated. I’ll also not explain XGBoost here, which is a big gradient boosting algorithm with fancy trees and a few extra clever tricks, but we might code one up in the future.

Bagging and boosting

The essential difference between a Random Forest and a Gradient Boosting machine boil down to the way the models get trained:

  • Random Forests train by bagging. Bagging stands for bootstrap aggregation and means training an amount of estimators in parallel on random subsamples of the training data drawn with replacement (that’s the bootstrap part) and then aggregating their predictions (the aggregation part). The predictions of the individual classifiers are then aggregated by voting, which can be soft (averaging predictions) or hard (majority vote wins). Bagging reduces variance more than it reduces bias, and helps against overfitting. Parallel fitting means that Random Forests can be parallelised on >1 process node or thread, and thus trained more efficiently. Random Forests can be used for regression (with regression trees) or for classification (with classification trees). Theoretically, any estimator can be used, not just a tree, but trees are most common.
  • Gradient Boosting is trained by boosting. Here, the estimators are trained successively. At first, the model makes a random prediction on the data, which by default is the mean value of the labels. For regression, this makes sense. You just start at the mean of the value and go from there. But for classification, this seems weird! Indeed, the algorithm makes a prediction of the probability of the class (by default 0.5) . Then, some loss function is applied (in our example we will use mean squared error for regression and binary cross-entropy for classification) and the next estimator is trained on the gradient of the loss with respect to the prediction (this is where the gradient part happens). The contribution of this next estimator (and every one after it) is scaled by a factor known as the learning rate to prevent overfitting the residuals (this is also called shrinkage). Gradient Boosting is very effective (boosting is a great technique and ResNets use a similar strategy) but more prone to overfitting than bagging. That’s why careful choice of the hyperparameters is needed and more complex models like XGBoost introduce additional regularisation parameters.

Let’s therefore summarise quickly what the main differences between our models are:

Bagging (Random Forest) Boosting (Gradient Boosting)
Training strategy Parallel Sequential
Estimators trained on Bootstrapped Subsample of the dataset Gradient of the loss function of the previous estimator w.r.t its prediction
All estimators equally important Yes No (learning rate scales successive contribution)
Prediction is made by Voting Weighted sum of individual predictions (scaled by learning rate)
Prone to overfitting Less More
Easy to parallelise Very Not trivially
Can use any estimator as base Yes Yes

The ingredients

As you will see, the ingredients for creating the two algorithms are extremely simple, and we will mostly code up stuff from scratch. We’ll use the scikit-learn implementations as benchmarks, alongside some convenience functions for loading two datasets (one for regression, one for classification), splitting data, two metrics (accuracy for classification and mean absolute error for regression), numpy, the Counter class for counting stuff (😆) and the wonderful JAX for automatic differentiation.

Library imports

from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error
from collections import Counter
import numpy as np

np.set_printoptions(precision=3)
import jax
import jax.numpy as jnp
from typing import Tuple, Union, List

Datasets

Lets create two datasets to play with, one for regression and one for classification:

Classification Dataset

X_class, y_class = load_breast_cancer(return_X_y=True)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_class, y_class)

Regression Dataset

X_reg, y_reg = load_boston(return_X_y=True)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg)

Random Forest

Our Random Forest takes four arguments: n_estimators, which is how many trees to build, subsample for the percentage of the training data to use for training each tree and regression for switching between a regression and a classification Random Forest (internally switches the trees). Other keyword arguments are passed to the DecisionTree initialiser and not required (i.e. abstracted away here).

class RandomForest:
    def __init__(
        self,
        n_estimators: int = 100,
        subsample: float = 0.1,
        regression: bool = True,
        **kwargs
    ) -> None:
        self.n_estimators = n_estimators
        self.estimators = []  # type:ignore
        self.subsample = subsample

        for _ in range(self.n_estimators):
            if regression:
                self.estimators.append(DecisionTreeRegressor(**kwargs))
            else:
                self.estimators.append(DecisionTreeClassifier(**kwargs))

    def bootstrap_sample(self, X: np.array, y: np.array) -> Tuple[np.array, ...]:
        n_samples = X.shape[0]
        idxs = np.random.choice(
            n_samples, size=int(self.subsample * n_samples), replace=True
        )
        return X[idxs], y[idxs]

    def fit(self, X: np.array, y: np.array) -> None:
        for estimator in self.estimators:
            X_sample, y_sample = self.bootstrap_sample(X, y)
            estimator.fit(X_sample, y_sample)

    def predict(self, X: np.array) -> np.array:
        preds = np.array([estimator.predict(X) for estimator in self.estimators])
        preds = np.swapaxes(preds, 0, 1)
        return np.array([self._most_common_pred(pred) for pred in preds])

    def _most_common_pred(self, y: np.array) -> Union[float, int]:
        return Counter(y).most_common(1)[0][0]

Let’s try her out!

Random Forest: Classification Task

rfc = RandomForest(regression=False)
rfc.fit(X_train_c, y_train_c)
print(accuracy_score(rfc.predict(X_test_c), y_test_c).round(3))
> 0.937

Random Forest: Regression Task

rfr = RandomForest(regression=True)
rfr.fit(X_train_r, y_train_r)
print(mean_absolute_error(y_test_r, rfr.predict(X_test_r)).round(3))
> 3.364

For comparison, the scikit-learnimplementation (without any tweaks) achieved the exact same score for classification and only a slightly better score for regression (2.44 MAE vs. 3.36, in the context of the Boston Housing dataset, that’s an error of less than 1000 dollars on the predicted price of a house).

Gradient Boosting

Our Gradient Boosting machine also takes four arguments. n_estimators is the number of trees to build, and learning_rate scales the contribution of each successive tree (the default value is 0.5, i.e. each new tree is only 50% as important as the previous one). The regressionargument does nothing to the kind of tree that is used! As you will see, Gradient Boosting always uses regression trees, since it is fitting a regression model on the gradient. The only thing this argument does is to change the loss function from MSE to binary cross-entropy and the output to a probability of the classes instead of a predicted regression value.

This model is a little bit more complex, so let’s go through it step by step.

Loss functions

We’ll first build two loss functions, MSE for regression and binary cross-entropy for classification. MSE is very straightforward:

def MSE(y_true: jnp.array, y_pred: jnp.array):
    return jnp.mean(jnp.sum(jnp.square(y_true - y_pred)))

Cross-Entropy is a little more involved, since we are using logs, which can freak out if we pass them negative values or zeros. Therefore, we use an epsilon value an clip the values of the argument passed as the probability:

def CrossEntropy(y_true: jnp.array, y_proba: jnp.array):
    y_proba = jnp.clip(y_proba, 1e-5, 1 - 1e-5)
    return jnp.sum(-y_true * jnp.log(y_proba) - (1 - y_true) * jnp.log(1 - y_proba))

Notice in both cases that we are specifying the operations using not a regular np.array but a jnp.array. If we wanted, we could have just imported jax.numpy as np from the start, but I wanted to show how smoothly and seamlessly JAX integrates with numpy! The gradients are obviously calculated on the fly for us and we don’t even need to worry about them.

The engine room

The initialiser of the Gradient Boosting algorithm looks a lot like the Random Forest one. It chooses the appropriate loss function and fires off the estimators.

Let’s now look at the fit function in detail:

    def fit(self, X: np.array, y: np.array):
        y_pred = np.full(np.shape(y), np.mean(y))
        for i, estimator in enumerate(self.estimators):
            gradient = jax.grad(self.loss, argnums=1)(
                y.astype(np.float32), y_pred.astype(np.float32)
            )
            self.estimators[i].fit(X, gradient)
            update = self.estimators[i].predict(X)
            y_pred -= self.learning_rate * update

The first strange thing is that y_predgets assigned before a single estimator did anything. This is the dummy prediction phase of the algorithm, where it just picks the mean of the label vector. np.fullis a great function which just fills up an array with a placeholder value.

Then the algorithm calculates the loss of this dummy prediction with respect to the true labels, and the gradient of the loss function with respect to the dummy prediction. Again, JAX smoothly takes the gradient. For safety, we coerce the type into a float, since in the classification scenario, the labels are integers, which obviously have no gradient.

After this, the algorithm fits the next estimator on the gradient of the loss, performs an actual estimator prediction on the training set, and finally updates the y_pred vector with the new values (a bit like gradient descent, but for the prediction, not a weight matrix!). Observe that self.estimators[i].predict(X)is not the class predict, but the DecisionTreeRegressor.predict function!

For the predict phase, things happen the same way. Each fitted estimator predicts the test data and its contribution, scaled by the learning rate, gets considered:

    def predict(self, X: np.array):
        y_pred = np.zeros(X.shape[0], dtype=np.float32)
        for estimator in self.estimators:
            y_pred -= self.learning_rate * estimator.predict(X)

        if not self.regression:
            return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0)
        return y_pred

Gradient Boosting: Classification Task

gbc = GradientBoosting(regression=False)
gbc(X_train_c, y_train_c)
print(accuracy_score(gbc.predict(X_test_c), y_test_c).round(3))
> 0.923

Gradient Boosting: Regression Task

gbr = GradientBoosting(regression=True)
gbr.fit(X_train_r, y_train_r)
print(mean_absolute_error(gbr.predict(X_test_r), y_test_r).round(3))
> 22.78

For comparison, the scikit-learnimplementation achieved around 2% higher for classification (0.964) , but was quite a bit better (2.6 vs. 22.8) in regression. That said, the scikit-learn implementation is not only more complicated, but there is also probably some overfitting going on.

Here’s the final implementation for the Gradient Boosting model:

class GradientBoosting:
    def __init__(
        self,
        n_estimators: int = 100,
        learning_rate: float = 0.1,
        regression: bool = True,
        **kwargs
    ):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.regression = regression
        self.loss = MSE if self.regression else CrossEntropy

        self.estimators = []  # type:ignore
        for _ in range(self.n_estimators):
            self.estimators.append(DecisionTreeRegressor(**kwargs))

    def fit(self, X: np.array, y: np.array):
        y_pred = np.full(np.shape(y), np.mean(y))
        for i, estimator in enumerate(self.estimators):
            gradient = jax.grad(self.loss, argnums=1)(
                y.astype(np.float32), y_pred.astype(np.float32)
            )
            self.estimators[i].fit(X, gradient)
            update = self.estimators[i].predict(X)
            y_pred -= self.learning_rate * update

    def predict(self, X: np.array):
        y_pred = np.zeros(X.shape[0], dtype=np.float32)
        for estimator in self.estimators:
            y_pred -= self.learning_rate * estimator.predict(X)

        if not self.regression:
            return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0)
        return y_pred

Conclusion

As you can see (you know who you are!) , neither Random Forests nor Gradient Boosting are very complicated, once the fundamental differences become clear! We also learned about JAX and how it can be used to easily combine numpy functions with powerful automatic differentiation.


References and further reading

Georgios Kaissis
Georgios Kaissis
Specialist Diagnostic Radiologist - Senior Research Scientist

My research interests include AI-based medical image analysis, probabilistic methods and privacy-preserving AI

Related