# Ensemble Models for S.

My best mate is studying machine learning and found the differences between Random Forests and Gradient Boosting Machines a bit opaque.

My promise to him was to explain it as simple as possible, so here is a short post explaining both models as simple as possible in code!

*Hint: Try the GPU runtime! JAX will calculate the gradients using the GPU or the CPU!*

## Preliminaries

We will make one abstraction: I won’t be coding up the decision trees from scratch and instead use the `scikit-learn`

implementation. This removes the complexities related to growing the trees and lets us focus on understanding the different way of creating the models. In a future post, we’ll talk about how trees are grown and pruned, and how feature importance is calculated. I’ll also not explain *XGBoost* here, which is a big gradient boosting algorithm with fancy trees and a few extra clever tricks, but we might code one up in the future.

## Bagging and boosting

The essential difference between a Random Forest and a Gradient Boosting machine boil down to the way the models get trained:

- Random Forests train by
*bagging*. Bagging stands for*bootstrap aggregation*and means training an amount of estimators**in parallel**on random subsamples of the training data drawn with replacement (that’s the bootstrap part) and then aggregating their predictions (the aggregation part). The predictions of the individual classifiers are then aggregated by voting, which can be soft (averaging predictions) or hard (majority vote wins). Bagging reduces variance more than it reduces bias, and helps against overfitting. Parallel fitting means that Random Forests can be parallelised on >1 process node or thread, and thus trained more efficiently. Random Forests can be used for regression (with regression trees) or for classification (with classification trees). Theoretically, any estimator can be used, not just a tree, but trees are most common. - Gradient Boosting is trained by
*boosting*. Here, the estimators are trained**successively**. At first, the model makes a random prediction on the data, which by default is the mean value of the labels. For regression, this makes sense. You just start at the mean of the value and go from there. But for classification, this seems weird! Indeed, the algorithm makes a prediction of the*probability*of the class (by default 0.5) . Then, some loss function is applied (in our example we will use mean squared error for regression and binary cross-entropy for classification) and the next estimator is trained on the gradient of the loss with respect to the prediction (this is where the gradient part happens). The contribution of this next estimator (and every one after it) is scaled by a factor known as the learning rate to prevent overfitting the residuals (this is also called*shrinkage*). Gradient Boosting is very effective (boosting is a great technique and ResNets use a similar strategy) but more prone to overfitting than bagging. That’s why careful choice of the hyperparameters is needed and more complex models like XGBoost introduce additional regularisation parameters.

Let’s therefore summarise quickly what the main differences between our models are:

Bagging (Random Forest) | Boosting (Gradient Boosting) | |
---|---|---|

Training strategy | Parallel | Sequential |

Estimators trained on | Bootstrapped Subsample of the dataset | Gradient of the loss function of the previous estimator w.r.t its prediction |

All estimators equally important | Yes | No (learning rate scales successive contribution) |

Prediction is made by | Voting | Weighted sum of individual predictions (scaled by learning rate) |

Prone to overfitting | Less | More |

Easy to parallelise | Very | Not trivially |

Can use any estimator as base | Yes | Yes |

## The ingredients

As you will see, the ingredients for creating the two algorithms are extremely simple, and we will mostly code up stuff from scratch. We’ll use the `scikit-learn`

implementations as benchmarks, alongside some convenience functions for loading two datasets (one for regression, one for classification), splitting data, two metrics (accuracy for classification and mean absolute error for regression), `numpy`

, the `Counter`

class for counting stuff (😆) and the wonderful JAX for automatic differentiation.

## Library imports

```
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.datasets import load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_absolute_error
from collections import Counter
import numpy as np
np.set_printoptions(precision=3)
import jax
import jax.numpy as jnp
from typing import Tuple, Union, List
```

## Datasets

Lets create two datasets to play with, one for regression and one for classification:

### Classification Dataset

```
X_class, y_class = load_breast_cancer(return_X_y=True)
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_class, y_class)
```

### Regression Dataset

```
X_reg, y_reg = load_boston(return_X_y=True)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg)
```

## Random Forest

Our Random Forest takes four arguments: `n_estimators`

, which is how many trees to build, `subsample`

for the percentage of the training data to use for training each tree and `regression`

for switching between a regression and a classification Random Forest (internally switches the trees). Other keyword arguments are passed to the `DecisionTree`

initialiser and not required (i.e. abstracted away here).

```
class RandomForest:
def __init__(
self,
n_estimators: int = 100,
subsample: float = 0.1,
regression: bool = True,
**kwargs
) -> None:
self.n_estimators = n_estimators
self.estimators = [] # type:ignore
self.subsample = subsample
for _ in range(self.n_estimators):
if regression:
self.estimators.append(DecisionTreeRegressor(**kwargs))
else:
self.estimators.append(DecisionTreeClassifier(**kwargs))
def bootstrap_sample(self, X: np.array, y: np.array) -> Tuple[np.array, ...]:
n_samples = X.shape[0]
idxs = np.random.choice(
n_samples, size=int(self.subsample * n_samples), replace=True
)
return X[idxs], y[idxs]
def fit(self, X: np.array, y: np.array) -> None:
for estimator in self.estimators:
X_sample, y_sample = self.bootstrap_sample(X, y)
estimator.fit(X_sample, y_sample)
def predict(self, X: np.array) -> np.array:
preds = np.array([estimator.predict(X) for estimator in self.estimators])
preds = np.swapaxes(preds, 0, 1)
return np.array([self._most_common_pred(pred) for pred in preds])
def _most_common_pred(self, y: np.array) -> Union[float, int]:
return Counter(y).most_common(1)[0][0]
```

Let’s try her out!

### Random Forest: Classification Task

```
rfc = RandomForest(regression=False)
rfc.fit(X_train_c, y_train_c)
print(accuracy_score(rfc.predict(X_test_c), y_test_c).round(3))
> 0.937
```

### Random Forest: Regression Task

```
rfr = RandomForest(regression=True)
rfr.fit(X_train_r, y_train_r)
print(mean_absolute_error(y_test_r, rfr.predict(X_test_r)).round(3))
> 3.364
```

For comparison, the `scikit-learn`

implementation (without any tweaks) achieved the exact same score for classification and only a slightly better score for regression (2.44 MAE vs. 3.36, in the context of the Boston Housing dataset, that’s an error of less than 1000 dollars on the predicted price of a house).

## Gradient Boosting

Our Gradient Boosting machine also takes four arguments. `n_estimators`

is the number of trees to build, and `learning_rate`

scales the contribution of each successive tree (the default value is 0.5, i.e. each new tree is only 50% as important as the previous one). The `regression`

argument does nothing to the kind of tree that is used! As you will see, Gradient Boosting *always* uses regression trees, since it is fitting a regression model on the gradient. The only thing this argument does is to change the loss function from MSE to binary cross-entropy and the output to a probability of the classes instead of a predicted regression value.

This model is a little bit more complex, so let’s go through it step by step.

### Loss functions

We’ll first build two loss functions, MSE for regression and binary cross-entropy for classification. MSE is very straightforward:

```
def MSE(y_true: jnp.array, y_pred: jnp.array):
return jnp.mean(jnp.sum(jnp.square(y_true - y_pred)))
```

Cross-Entropy is a little more involved, since we are using logs, which can freak out if we pass them negative values or zeros. Therefore, we use an epsilon value an clip the values of the argument passed as the probability:

```
def CrossEntropy(y_true: jnp.array, y_proba: jnp.array):
y_proba = jnp.clip(y_proba, 1e-5, 1 - 1e-5)
return jnp.sum(-y_true * jnp.log(y_proba) - (1 - y_true) * jnp.log(1 - y_proba))
```

Notice in both cases that we are specifying the operations using not a regular `np.array`

but a `jnp.array`

. If we wanted, we could have just imported `jax.numpy as np`

from the start, but I wanted to show how smoothly and seamlessly JAX integrates with `numpy`

! The gradients are obviously calculated on the fly for us and we don’t even need to worry about them.

### The engine room

The initialiser of the Gradient Boosting algorithm looks a lot like the Random Forest one. It chooses the appropriate loss function and fires off the estimators.

Let’s now look at the fit function in detail:

```
def fit(self, X: np.array, y: np.array):
y_pred = np.full(np.shape(y), np.mean(y))
for i, estimator in enumerate(self.estimators):
gradient = jax.grad(self.loss, argnums=1)(
y.astype(np.float32), y_pred.astype(np.float32)
)
self.estimators[i].fit(X, gradient)
update = self.estimators[i].predict(X)
y_pred -= self.learning_rate * update
```

The first strange thing is that `y_pred`

gets assigned before a single estimator did anything. This is the *dummy prediction* phase of the algorithm, where it just picks the mean of the label vector. `np.full`

is a great function which just fills up an array with a placeholder value.

Then the algorithm calculates the loss of this dummy prediction with respect to the true labels, and the gradient of the loss function with respect to the dummy prediction. Again, JAX smoothly takes the gradient. For safety, we coerce the type into a float, since in the classification scenario, the labels are integers, which obviously have no gradient.

After this, the algorithm fits the next estimator on the gradient of the loss, performs an actual estimator prediction on the training set, and finally updates the `y_pred`

vector with the new values (a bit like gradient descent, but for the prediction, not a weight matrix!). Observe that `self.estimators[i].predict(X)`

is not the class `predict`

, but the `DecisionTreeRegressor.predict`

function!

For the predict phase, things happen the same way. Each fitted estimator predicts the test data and its contribution, scaled by the learning rate, gets considered:

```
def predict(self, X: np.array):
y_pred = np.zeros(X.shape[0], dtype=np.float32)
for estimator in self.estimators:
y_pred -= self.learning_rate * estimator.predict(X)
if not self.regression:
return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0)
return y_pred
```

### Gradient Boosting: Classification Task

```
gbc = GradientBoosting(regression=False)
gbc(X_train_c, y_train_c)
print(accuracy_score(gbc.predict(X_test_c), y_test_c).round(3))
> 0.923
```

### Gradient Boosting: Regression Task

```
gbr = GradientBoosting(regression=True)
gbr.fit(X_train_r, y_train_r)
print(mean_absolute_error(gbr.predict(X_test_r), y_test_r).round(3))
> 22.78
```

For comparison, the `scikit-learn`

implementation achieved around 2% higher for classification (0.964) , but was quite a bit better (2.6 vs. 22.8) in regression. That said, the `scikit-learn`

implementation is not only more complicated, but there is also probably some overfitting going on.

Here’s the final implementation for the Gradient Boosting model:

```
class GradientBoosting:
def __init__(
self,
n_estimators: int = 100,
learning_rate: float = 0.1,
regression: bool = True,
**kwargs
):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.regression = regression
self.loss = MSE if self.regression else CrossEntropy
self.estimators = [] # type:ignore
for _ in range(self.n_estimators):
self.estimators.append(DecisionTreeRegressor(**kwargs))
def fit(self, X: np.array, y: np.array):
y_pred = np.full(np.shape(y), np.mean(y))
for i, estimator in enumerate(self.estimators):
gradient = jax.grad(self.loss, argnums=1)(
y.astype(np.float32), y_pred.astype(np.float32)
)
self.estimators[i].fit(X, gradient)
update = self.estimators[i].predict(X)
y_pred -= self.learning_rate * update
def predict(self, X: np.array):
y_pred = np.zeros(X.shape[0], dtype=np.float32)
for estimator in self.estimators:
y_pred -= self.learning_rate * estimator.predict(X)
if not self.regression:
return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0)
return y_pred
```

## Conclusion

As you can see (you know who you are!) , neither Random Forests nor Gradient Boosting are very complicated, once the fundamental differences become clear!
We also learned about JAX and how it can be used to easily combine `numpy`

functions with powerful automatic differentiation.

## References and further reading

- https://scikit-learn.org/stable/modules/tree.html#tree
- https://github.com/eriklindernoren/ML-From-Scratch/blob/master/mlfromscratch/supervised_learning/gradient_boosting.py

Some code and general concepts reused from here under MIT license terms. - https://github.com/python-engineer/MLfromscratch/tree/master/mlfromscratch

Some code and general concepts reused from here under MIT license terms. - https://jax.readthedocs.io/en/latest/notebooks/autodiff_cookbook.html
- Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2019.