Ensemble Models for S.
My best mate is studying machine learning and found the differences between Random Forests and Gradient Boosting Machines a bit opaque.
My promise to him was to explain it as simple as possible, so here is a short post explaining both models as simple as possible in code!
Hint: Try the GPU runtime! JAX will calculate the gradients using the GPU or the CPU!
We will make one abstraction: I won’t be coding up the decision trees from scratch and instead use the
scikit-learn implementation. This removes the complexities related to growing the trees and lets us focus on understanding the different way of creating the models. In a future post, we’ll talk about how trees are grown and pruned, and how feature importance is calculated. I’ll also not explain XGBoost here, which is a big gradient boosting algorithm with fancy trees and a few extra clever tricks, but we might code one up in the future.
Bagging and boosting
The essential difference between a Random Forest and a Gradient Boosting machine boil down to the way the models get trained:
- Random Forests train by bagging. Bagging stands for bootstrap aggregation and means training an amount of estimators in parallel on random subsamples of the training data drawn with replacement (that’s the bootstrap part) and then aggregating their predictions (the aggregation part). The predictions of the individual classifiers are then aggregated by voting, which can be soft (averaging predictions) or hard (majority vote wins). Bagging reduces variance more than it reduces bias, and helps against overfitting. Parallel fitting means that Random Forests can be parallelised on >1 process node or thread, and thus trained more efficiently. Random Forests can be used for regression (with regression trees) or for classification (with classification trees). Theoretically, any estimator can be used, not just a tree, but trees are most common.
- Gradient Boosting is trained by boosting. Here, the estimators are trained successively. At first, the model makes a random prediction on the data, which by default is the mean value of the labels. For regression, this makes sense. You just start at the mean of the value and go from there. But for classification, this seems weird! Indeed, the algorithm makes a prediction of the probability of the class (by default 0.5) . Then, some loss function is applied (in our example we will use mean squared error for regression and binary cross-entropy for classification) and the next estimator is trained on the gradient of the loss with respect to the prediction (this is where the gradient part happens). The contribution of this next estimator (and every one after it) is scaled by a factor known as the learning rate to prevent overfitting the residuals (this is also called shrinkage). Gradient Boosting is very effective (boosting is a great technique and ResNets use a similar strategy) but more prone to overfitting than bagging. That’s why careful choice of the hyperparameters is needed and more complex models like XGBoost introduce additional regularisation parameters.
Let’s therefore summarise quickly what the main differences between our models are:
|Bagging (Random Forest)||Boosting (Gradient Boosting)|
|Estimators trained on||Bootstrapped Subsample of the dataset||Gradient of the loss function of the previous estimator w.r.t its prediction|
|All estimators equally important||Yes||No (learning rate scales successive contribution)|
|Prediction is made by||Voting||Weighted sum of individual predictions (scaled by learning rate)|
|Prone to overfitting||Less||More|
|Easy to parallelise||Very||Not trivially|
|Can use any estimator as base||Yes||Yes|
As you will see, the ingredients for creating the two algorithms are extremely simple, and we will mostly code up stuff from scratch. We’ll use the
scikit-learn implementations as benchmarks, alongside some convenience functions for loading two datasets (one for regression, one for classification), splitting data, two metrics (accuracy for classification and mean absolute error for regression),
Counter class for counting stuff (😆) and the wonderful JAX for automatic differentiation.
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier from sklearn.datasets import load_boston, load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, mean_absolute_error from collections import Counter import numpy as np np.set_printoptions(precision=3) import jax import jax.numpy as jnp from typing import Tuple, Union, List
Lets create two datasets to play with, one for regression and one for classification:
X_class, y_class = load_breast_cancer(return_X_y=True) X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_class, y_class)
X_reg, y_reg = load_boston(return_X_y=True) X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg)
Our Random Forest takes four arguments:
n_estimators, which is how many trees to build,
subsample for the percentage of the training data to use for training each tree and
regression for switching between a regression and a classification Random Forest (internally switches the trees). Other keyword arguments are passed to the
DecisionTree initialiser and not required (i.e. abstracted away here).
class RandomForest: def __init__( self, n_estimators: int = 100, subsample: float = 0.1, regression: bool = True, **kwargs ) -> None: self.n_estimators = n_estimators self.estimators =  # type:ignore self.subsample = subsample for _ in range(self.n_estimators): if regression: self.estimators.append(DecisionTreeRegressor(**kwargs)) else: self.estimators.append(DecisionTreeClassifier(**kwargs)) def bootstrap_sample(self, X: np.array, y: np.array) -> Tuple[np.array, ...]: n_samples = X.shape idxs = np.random.choice( n_samples, size=int(self.subsample * n_samples), replace=True ) return X[idxs], y[idxs] def fit(self, X: np.array, y: np.array) -> None: for estimator in self.estimators: X_sample, y_sample = self.bootstrap_sample(X, y) estimator.fit(X_sample, y_sample) def predict(self, X: np.array) -> np.array: preds = np.array([estimator.predict(X) for estimator in self.estimators]) preds = np.swapaxes(preds, 0, 1) return np.array([self._most_common_pred(pred) for pred in preds]) def _most_common_pred(self, y: np.array) -> Union[float, int]: return Counter(y).most_common(1)
Let’s try her out!
Random Forest: Classification Task
rfc = RandomForest(regression=False) rfc.fit(X_train_c, y_train_c) print(accuracy_score(rfc.predict(X_test_c), y_test_c).round(3)) > 0.937
Random Forest: Regression Task
rfr = RandomForest(regression=True) rfr.fit(X_train_r, y_train_r) print(mean_absolute_error(y_test_r, rfr.predict(X_test_r)).round(3)) > 3.364
For comparison, the
scikit-learnimplementation (without any tweaks) achieved the exact same score for classification and only a slightly better score for regression (2.44 MAE vs. 3.36, in the context of the Boston Housing dataset, that’s an error of less than 1000 dollars on the predicted price of a house).
Our Gradient Boosting machine also takes four arguments.
n_estimators is the number of trees to build, and
learning_rate scales the contribution of each successive tree (the default value is 0.5, i.e. each new tree is only 50% as important as the previous one). The
regressionargument does nothing to the kind of tree that is used! As you will see, Gradient Boosting always uses regression trees, since it is fitting a regression model on the gradient. The only thing this argument does is to change the loss function from MSE to binary cross-entropy and the output to a probability of the classes instead of a predicted regression value.
This model is a little bit more complex, so let’s go through it step by step.
We’ll first build two loss functions, MSE for regression and binary cross-entropy for classification. MSE is very straightforward:
def MSE(y_true: jnp.array, y_pred: jnp.array): return jnp.mean(jnp.sum(jnp.square(y_true - y_pred)))
Cross-Entropy is a little more involved, since we are using logs, which can freak out if we pass them negative values or zeros. Therefore, we use an epsilon value an clip the values of the argument passed as the probability:
def CrossEntropy(y_true: jnp.array, y_proba: jnp.array): y_proba = jnp.clip(y_proba, 1e-5, 1 - 1e-5) return jnp.sum(-y_true * jnp.log(y_proba) - (1 - y_true) * jnp.log(1 - y_proba))
Notice in both cases that we are specifying the operations using not a regular
np.array but a
jnp.array. If we wanted, we could have just imported
jax.numpy as np from the start, but I wanted to show how smoothly and seamlessly JAX integrates with
numpy! The gradients are obviously calculated on the fly for us and we don’t even need to worry about them.
The engine room
The initialiser of the Gradient Boosting algorithm looks a lot like the Random Forest one. It chooses the appropriate loss function and fires off the estimators.
Let’s now look at the fit function in detail:
def fit(self, X: np.array, y: np.array): y_pred = np.full(np.shape(y), np.mean(y)) for i, estimator in enumerate(self.estimators): gradient = jax.grad(self.loss, argnums=1)( y.astype(np.float32), y_pred.astype(np.float32) ) self.estimators[i].fit(X, gradient) update = self.estimators[i].predict(X) y_pred -= self.learning_rate * update
The first strange thing is that
y_predgets assigned before a single estimator did anything. This is the dummy prediction phase of the algorithm, where it just picks the mean of the label vector.
np.fullis a great function which just fills up an array with a placeholder value.
Then the algorithm calculates the loss of this dummy prediction with respect to the true labels, and the gradient of the loss function with respect to the dummy prediction. Again, JAX smoothly takes the gradient. For safety, we coerce the type into a float, since in the classification scenario, the labels are integers, which obviously have no gradient.
After this, the algorithm fits the next estimator on the gradient of the loss, performs an actual estimator prediction on the training set, and finally updates the
y_pred vector with the new values (a bit like gradient descent, but for the prediction, not a weight matrix!). Observe that
self.estimators[i].predict(X)is not the class
predict, but the
For the predict phase, things happen the same way. Each fitted estimator predicts the test data and its contribution, scaled by the learning rate, gets considered:
def predict(self, X: np.array): y_pred = np.zeros(X.shape, dtype=np.float32) for estimator in self.estimators: y_pred -= self.learning_rate * estimator.predict(X) if not self.regression: return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0) return y_pred
Gradient Boosting: Classification Task
gbc = GradientBoosting(regression=False) gbc(X_train_c, y_train_c) print(accuracy_score(gbc.predict(X_test_c), y_test_c).round(3)) > 0.923
Gradient Boosting: Regression Task
gbr = GradientBoosting(regression=True) gbr.fit(X_train_r, y_train_r) print(mean_absolute_error(gbr.predict(X_test_r), y_test_r).round(3)) > 22.78
For comparison, the
scikit-learnimplementation achieved around 2% higher for classification (0.964) , but was quite a bit better (2.6 vs. 22.8) in regression. That said, the
scikit-learn implementation is not only more complicated, but there is also probably some overfitting going on.
Here’s the final implementation for the Gradient Boosting model:
class GradientBoosting: def __init__( self, n_estimators: int = 100, learning_rate: float = 0.1, regression: bool = True, **kwargs ): self.n_estimators = n_estimators self.learning_rate = learning_rate self.regression = regression self.loss = MSE if self.regression else CrossEntropy self.estimators =  # type:ignore for _ in range(self.n_estimators): self.estimators.append(DecisionTreeRegressor(**kwargs)) def fit(self, X: np.array, y: np.array): y_pred = np.full(np.shape(y), np.mean(y)) for i, estimator in enumerate(self.estimators): gradient = jax.grad(self.loss, argnums=1)( y.astype(np.float32), y_pred.astype(np.float32) ) self.estimators[i].fit(X, gradient) update = self.estimators[i].predict(X) y_pred -= self.learning_rate * update def predict(self, X: np.array): y_pred = np.zeros(X.shape, dtype=np.float32) for estimator in self.estimators: y_pred -= self.learning_rate * estimator.predict(X) if not self.regression: return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0) return y_pred
As you can see (you know who you are!) , neither Random Forests nor Gradient Boosting are very complicated, once the fundamental differences become clear!
We also learned about JAX and how it can be used to easily combine
numpy functions with powerful automatic differentiation.
References and further reading
Some code and general concepts reused from here under MIT license terms.
Some code and general concepts reused from here under MIT license terms.
- Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2019.