We are now going to look at ways in which multiple machine learning can be combined.
In particular, we will look at a way of combining models called boosting.
At a high level, a supervised machine learning problem has the following structure:
$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$Overfitting is one of the most common failure modes of machine learning.
The idea of bagging is to reduce overfitting by averaging many models trained on random subsets of the data.
for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)
# output average prediction at test time:
y_test = ensemble.average_prediction(y_test)
The data samples are taken with replacement and known as bootstrap samples.
Underfitting is another common problem in machine learning.
The idea of boosting is to reduce underfitting by combining models that correct each others' errors.
A key ingredient of a boosting algorithm is a weak learner.
The idea of boosting is to reduce underfitting by combining models that correct each others' errors.
In Python-like pseudocode this looks as follows:
weights = np.ones(n_data,)
for i in range(n_models):
model = SimpleBaseModel().fit(X, y, weights)
predictions = model.predict(X)
weights = update_weights(weights, predictions)
ensemble.add(model)
# output consensus prediction at test time:
y_test = ensemble.consensus_prediction(y_test)
Boosting algorithms were initially developed in the 90s within theoretical machine learning.
Today, there exist many algorithms that are considered types of boosting, even though they were not derived from a theoretical angle.
One of the first practical boosting algorithms was Adaboost.
We start with uniform $w^{(i)} = 1/n$ and $f = 0$. Then for $t=1,2,...,T$:
Let's implement Adaboost on a simple dataset to see what it can do.
Let's start by creating a classification dataset.
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_twoclass.html
import numpy as np
from sklearn.datasets import make_gaussian_quantiles
# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2., n_samples=200, n_features=2, n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5, n_samples=300, n_features=2, n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))
We can visualize this dataset using matplotlib
.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
# Plot the training points
plot_colors, plot_step, class_names = "br", 0.02, "AB"
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], cmap=plt.cm.Paired, s=60, edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
<matplotlib.legend.Legend at 0x12afda198>
Let's now train Adaboost on this dataset.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=200)
bdt.fit(X, y)
AdaBoostClassifier(algorithm='SAMME', base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=200)
Visualizing the output of the algorithm, we see that it can learn a highly non-linear decision boundary to separate the two classes.
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step))
# plot decision boundary
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
# plot training points
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], cmap=plt.cm.Paired, s=60, edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
<matplotlib.legend.Legend at 0x12b3b8438>
Boosting and bagging are special cases of ensembling.
The idea of ensembling is to combine many models into one. Bagging and Boosting are ensembling techniques to reduce over- and under-fitting.
Ensembling is a useful tecnique in machine learning.
Disadvantages include:
Next, we are going to see another perspective on boosting and derive new boosting algorithms.
We can define the high-level structure of a supervised learning algorithm as consisting of three components:
Underfitting is another common problem in machine learning.
The idea of boosting is to reduce underfitting by combining models that correct each others' errors.
Boosting can be seen as a way of fitting an additive model: $$ f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t). $$
This is more general than a linear model, because $g$ can be non-linear in $\phi_t$ (therefore so is $f$).
Boosting is one way of training additive models.
A general way to fit additive models is the forward stagewise approach.
Give a binary classification problem with labels $\mathcal{Y} = \{-1, +1\}$, the exponential loss is defined as
$$ L(y, f) = \exp(-y \cdot f). $$Let's visualize the exponential loss and compare it to other losses.
from matplotlib import pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = [12, 4]
# define the losses for a target of y=1
losses = {
'Hinge' : lambda f: np.maximum(1 - f, 0),
'L2': lambda f: (1-f)**2,
'L1': lambda f: np.abs(f-1),
'Exponential': lambda f: np.exp(-f)
}
# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
plt.tight_layout()
Adaboost is an instance of forward stagewise additive modeling with the expoential loss.
At each step $t$ we minimize $$L_t = \sum_{i=1}^n e^{-y^{(i)}(f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi))} = \sum_{i=1}^n w^{(i)} \exp\left(-y^{(i)}\alpha g(x^{(i)}; \phi)\right) $$ with $w^{(i)} = \exp(-y^{(i)}f_{t-1}(x^{(i)}))$.
We can derive the Adaboost update rules from this equation.
Suppose that $g(y; \phi) \in \{-1,1\}$. With a bit of algebraic manipulations, we get that: \begin{align*} L_t & = e^{\alpha} \sum_{y^{(i)} \neq g(x^{(i)})} w^{(i)} + e^{-\alpha} \sum_{y^{(i)} = g(x^{(i)})} w^{(i)} \\ & = (e^{\alpha} - e^{-\alpha}) \sum_{i=1}^n w^{(i)} \mathbb{I}\{{y^{(i)} \neq g(x^{(i)})}\} + e^{-\alpha} \sum_{i=1}^n w^{(i)}.\\ \end{align*} where $\mathbb{I}\{\cdot\}$ is the indicator function.
From there, we get that: \begin{align*} \phi_t & = \arg\min_{\phi} \sum_{i=1}^n w^{(i)} \mathbb{I}\{{y^{(i)} \neq g(x^{(i)}; \phi)}\} \\ \alpha_t & = \log[(1-e_t)/e_t] \end{align*} where $e_t = \frac{\sum_{i=1}^n w^{(i)} \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\}}{\sum_{i=1}^n w^{(i)}\}}$.
These are update rules for Adaboost, and it's not hard to show that the update rule for $w^{(i)}$ is the same as well.
Another popular choice of loss is the squared loss. $$ L(y, f) = (y-f)^2. $$
The resulting algorithm is often called L2Boost. At step $t$ we minimize $$\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \phi))^2, $$ where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$.
Another common loss is the log-loss. When $\mathcal{Y}=\{-1,1\}$ it is defined as:
$$L(y, f) = \log(1+\exp(-2\cdot y\cdot f)).$$This looks like the log of the exponential loss; it is less sensitive to outliers since it doesn't penalize large errors as much.
from matplotlib import pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = [12, 4]
# define the losses for a target of y=1
losses = {
'Hinge' : lambda f: np.maximum(1 - f, 0),
'L2': lambda f: (1-f)**2,
'Logistic': lambda f: np.log(1+np.exp(-2*f)),
'Exponential': lambda f: np.exp(-f)
}
# plot them
f = np.linspace(0, 2)
fig, axes = plt.subplots(2,2)
for ax, (name, loss) in zip(axes.flatten(), losses.items()):
ax.plot(f, loss(f))
ax.set_title('%s Loss' % name)
ax.set_xlabel('Prediction f')
ax.set_ylabel('L(y=1,f)')
ax.set_ylim([0,1])
plt.tight_layout()
In the context of boosting, we minimize $$J(\alpha, \phi) = \sum_{i=1}^n \log\left(1+\exp\left(-2y^{(i)}(f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi)\right)\right).$$
This give a different weight update compared to Adabost. This algorithm is called LogitBoost.
The boosting algorithms we have seen so far improve over Adaboost.
Cons:
We are now going to see another way of deriving boosting algorithms that is inspired by gradient descent.
The idea of boosting is to reduce underfitting by combining models that correct each others' errors.
Boosting can be seen as a way of fitting an additive model: $$ f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t). $$
This is not a linear model, because $g$ can be non-linear in $\phi_t$ (therefore so is $f$).
A general way to fit additive models is the forward stagewise approach.
We have seen several losses that can be used with the forward stagewise additive approach.
Forward stagewise additive modeling is not without limitations.
Functional optimization offers a different angle on boosting algorithms and a recipe for new algorithms.
To simplify our explanations, we will assume that there exists a true deterministic mapping $$f^* : \mathcal{X} \to \mathcal{Y}$$ between $\mathcal{X}$ and $\mathcal{Y}$, but the algorithm shown here works perfectly without this assumption.
Consider solving the optimization problem using gradient descent: $$J(f) = \min_f \sum_{i=1}^n L(y^{(i)}, f(x^{(i)})).$$
We may define the functional gradient of this loss at $f_0$ as a function $\nabla J(f_0) : \mathcal{X} \to \mathbb{R}$ $$\nabla J(f_0)(x) = \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$
Let's make a few observations about the functional gradient $$\nabla J(f_0)(x) = \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$
This is best understood via a picture.
We can optimize our objective using gradient descent in functional space via the usual update rule: $$f \gets f - \alpha \nabla J(f).$$
As defined, this is not a practical algorithm:
We will address this problem by learning a model of gradients.
We will apply the same idea to gradients.
Functional descent then has the form: $$\underbrace{f(x)}_\text{new function} \gets \underbrace{f(x) - \alpha g(x)}_\text{old function - gradient step}.$$ If $g$ generalizes, this approximates $f \gets f - \alpha \nabla J(f).$
What does it mean to approximate a functional gradient $g \approx \nabla_\textbf{f} J(\textbf{f})$ in practice? We can use standard supervised learning.
Suppose we have a fixed function $f$ and we want to estimate the functional gradient of $L$ $$\frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$ at any $x \in \mathcal{X}$$
Gradient boosting is a procedure that performs functional gradient descent with approximate gradients.
Start with $f(x) = 0$. Then, at each step $t>1$:
Notice how after $T$ steps we get an additive model of the form $$ f(x) = \sum_{t=1}^T \alpha_t g_t(x). $$ This looks like the output of a boosting algorithm!
Consider, for example, L2Boost, which optimizes the L2 loss $$ L(y, f) = \frac{1}{2}(y-f)^2. $$
At step $t$ we minimize $$\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \phi))^2, $$ where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$.
Observe that the residual $$r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$$ is also the gradient of the $L2$ loss with respect to $f$ as $f(x^{(i)})$ $$r^{(i)}_t = \frac{\partial L(y^{(i)}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x)}$$
Most boosting algorithms are special cases of gradient boosting in this way.
Gradient boosting can optimize a wide range of losses.
When using gradient boosting these additional facts are useful:
Let's now try running Gradient Boosted Decision Trees on a small regression dataset.
First we create the dataset.
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)
# Create dataset
f = lambda x: x * np.sin(x)
y = f(X).ravel()
dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise
# Visualize it
xx = np.atleast_2d(np.linspace(0, 10, 1000)).T
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
[<matplotlib.lines.Line2D at 0x12ed61898>]
Next, we train a GBDT regressor.
from sklearn.ensemble import GradientBoostingRegressor
alpha = 0.95
clf = GradientBoostingRegressor(loss='ls', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
clf.fit(X, y)
GradientBoostingRegressor(alpha=0.95, min_samples_leaf=9, min_samples_split=9, n_estimators=250)
We may now visualize its predictions
y_pred = clf.predict(xx)
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction')
[<matplotlib.lines.Line2D at 0x12c98e438>]
Gradient boosted decision trees (GBTs) are one of the best off-the-shelf ML algorithms that exist, often on par with deep learning.
Their main limitations are:
Functional optimization offers a different angle on boosting algorithms and a recipe for new algorithms.
Consider solving the optimization problem using gradient descent: $$J(\textbf{f}) = \min_\textbf{f} \sum_{i=1}^n L(y^{(i)}, \textbf{f}_i).$$ We may define the functional gradient of this loss as $$\nabla_\textbf{f} J(\textbf{f}) = \begin{bmatrix} \frac{\partial L(y^{(1)}, \textbf{f}_1)}{\partial \textbf{f}_1} \\ \frac{\partial L(y^{(2)}, \textbf{f}_2)}{\partial \textbf{f}_2} \\ \vdots \\ \frac{\partial L(y^{(n)}, \textbf{f}_n)}{\partial \textbf{f}_n} \\ \end{bmatrix}.$$
We can optimize our objective using gradient descent in functional space via the usual update rule: $$\textbf{f} \gets \textbf{f} - \alpha \nabla_\textbf{f} J(\textbf{f}).$$
As defined, this is not a practical algorithm:
We will address this problem by learning a model of gradients.
In supervised learning, we define a model $f:\mathcal{X} \to \mathcal{Y}$ for $\textbf{f}$ within a class $\mathcal{M}$. \begin{align*} f \in \mathcal{M} & & f \approx \textbf{f} \end{align*} The model extrapolates beyond the training set and ensures we generalize.
We will apply the same idea to gradients. We assume a model $g : \mathcal{X} \to R$ of the functional gradient $\nabla_\textbf{f} J(\textbf{f})$ within a class $\mathcal{M}$. \begin{align*} g \in \mathcal{M} & & g \approx \nabla_\textbf{f} J(\textbf{f}) \end{align*} Our model of gradients can generalize beyond the training set.
Functional descent then has the form: $$\underbrace{f(x)}_\text{new function} \gets \underbrace{f(x) - \alpha g(x)}_\text{old function - gradient step}.$$ If $g$ generalizes, this approximates $\textbf{f} \gets \textbf{f} - \alpha \nabla_\textbf{f} J(\textbf{f})$ at any $n$ points.
What does it mean to approximate a functional gradient $g \approx \nabla_\textbf{f} J(\textbf{f})$ in practice? We can use standard supervised learning.
Suppose we have a fixed function $f$ and we want to estimate the functional gradient of $L$ $$ \frac{\partial L(y, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f(x)}$$ at any value of $f(x)$.
Gradient boosting is a procedure that performs functional gradient descent with approximate gradients.
Start with $f(x) = 0$. Then, at each step $t>1$:
Notice how after $T$ steps we get an additive model of the form $$ f(x) = \sum_{t=1}^T \alpha_t g_t(x). $$ This looks like the output of a boosting algorithm!
Gradient boosting can optimize a wide range of losses.
When using gradient boosting these additional facts are useful:
Let's now try running Gradient Boosted Decision Trees on a small regression dataset.
First we create the dataset.
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_quantile.html
X = np.atleast_2d(np.random.uniform(0, 10.0, size=100)).T
X = X.astype(np.float32)
# Create dataset
f = lambda x: x * np.sin(x)
y = f(X).ravel()
dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise
# Visualize it
xx = np.atleast_2d(np.linspace(0, 10, 1000)).T
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
[<matplotlib.lines.Line2D at 0x12ed61898>]
Next, we train a GBDT regressor.
from sklearn.ensemble import GradientBoostingRegressor
alpha = 0.95
clf = GradientBoostingRegressor(loss='ls', alpha=alpha,
n_estimators=250, max_depth=3,
learning_rate=.1, min_samples_leaf=9,
min_samples_split=9)
clf.fit(X, y)
GradientBoostingRegressor(alpha=0.95, min_samples_leaf=9, min_samples_split=9, n_estimators=250)
We may now visualize its predictions
y_pred = clf.predict(xx)
plt.plot(xx, f(xx), 'g:', label=r'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=10, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Prediction')
[<matplotlib.lines.Line2D at 0x12c98e438>]
Gradient boosted decision trees (GBTs) are one of the best off-the-shelf ML algorithms that exist, often on par with deep learning.
Their main limitations are: