Lecture 13: Boosting

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Boosting and Ensembling

We are now going to look at ways in which multiple machine learning can be combined.

In particular, we will look at a way of combining models called boosting.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

Review: Overfitting

Overfitting is one of the most common failure modes of machine learning.

Review: Bagging

The idea of bagging is to reduce overfitting by averaging many models trained on random subsets of the data.

for i in range(n_models):
    # collect data samples and fit models
    X_i, y_i = sample_with_replacement(X, y, n_samples)
    model = Model().fit(X_i, y_i)
    ensemble.append(model)

# output average prediction at test time:
y_test = ensemble.average_prediction(y_test)

The data samples are taken with replacement and known as bootstrap samples.

Review: Underfitting

Underfitting is another common problem in machine learning.

Boosting

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

Weak Learners

A key ingredient of a boosting algorithm is a weak learner.

Structure of a Boosting Algorithm

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

  1. Fit a weak learner $g_0$ on dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$. Let $f=g$.
  1. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors.
  1. Fit another weak learner $g_1$ on $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$ with weights $w^{(i)}$.
  1. Set $f_1 = g_0 + \alpha_1 g$ for some weight $\alpha_1$. Go to Step 2 and repeat.

In Python-like pseudocode this looks as follows:

weights = np.ones(n_data,)
for i in range(n_models):
    model = SimpleBaseModel().fit(X, y, weights)
    predictions = model.predict(X)
    weights = update_weights(weights, predictions)
    ensemble.add(model)

# output consensus prediction at test time:
y_test = ensemble.consensus_prediction(y_test)

Origins of Boosting

Boosting algorithms were initially developed in the 90s within theoretical machine learning.

Today, there exist many algorithms that are considered types of boosting, even though they were not derived from a theoretical angle.

Algorithm: Adaboost

Defining Adaboost

One of the first practical boosting algorithms was Adaboost.

We start with uniform $w^{(i)} = 1/n$ and $f = 0$. Then for $t=1,2,...,T$:

  1. Fit weak learner $g_t$ on $\mathcal{D}$ with weights $w^{(i)}$.
  1. Compute misclassification error $e_t = \frac{\sum_{i=1}^n w^{(i)} \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\}}{\sum_{i=1}^n w^{(i)}}$
  1. Compute model weight $\alpha_t = \log[(1-e_t)/e_t]$. Set $f \gets f + \alpha_t g_t$.
  1. Compute new data weights $w^{(i)} \gets w^{(i)}\exp[\alpha_t \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\} ]$.

Adaboost: An Example

Let's implement Adaboost on a simple dataset to see what it can do.

Let's start by creating a classification dataset.

We can visualize this dataset using matplotlib.

Let's now train Adaboost on this dataset.

Visualizing the output of the algorithm, we see that it can learn a highly non-linear decision boundary to separate the two classes.

Ensembling

Boosting and bagging are special cases of ensembling.

The idea of ensembling is to combine many models into one. Bagging and Boosting are ensembling techniques to reduce over- and under-fitting.

Pros and Cons of Ensembling

Ensembling is a useful tecnique in machine learning.

Disadvantages include:

Part 2: Additive Models

Next, we are going to see another perspective on boosting and derive new boosting algorithms.

The Components of A Supervised Machine Learning Algorithm

We can define the high-level structure of a supervised learning algorithm as consisting of three components:

Review: Underfitting

Underfitting is another common problem in machine learning.

Review: Boosting

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

Additive Models

Boosting can be seen as a way of fitting an additive model: $$ f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t). $$

This is more general than a linear model, because $g$ can be non-linear in $\phi_t$ (therefore so is $f$).

Example: Boosting Algorithms

Boosting is one way of training additive models.

  1. Fit a weak learner $g_0$ on dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$. Let $f=g$.
  1. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors.
  1. Fit another weak learner $g_1$ on $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$ with weights $w^{(i)}$.
  1. Set $f_1 = g_0 + \alpha_1 g$ for some weight $\alpha_1$. Go to Step 2 and repeat.

Forward Stagewise Additive Modeling

A general way to fit additive models is the forward stagewise approach.

Practical Considerations

Exponential Loss

Give a binary classification problem with labels $\mathcal{Y} = \{-1, +1\}$, the exponential loss is defined as

$$ L(y, f) = \exp(-y \cdot f). $$

Let's visualize the exponential loss and compare it to other losses.

Special Case: Adaboost

Adaboost is an instance of forward stagewise additive modeling with the expoential loss.

At each step $t$ we minimize $$L_t = \sum_{i=1}^n e^{-y^{(i)}(f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi))} = \sum_{i=1}^n w^{(i)} \exp\left(-y^{(i)}\alpha g(x^{(i)}; \phi)\right) $$ with $w^{(i)} = \exp(-y^{(i)}f_{t-1}(x^{(i)}))$.

We can derive the Adaboost update rules from this equation.

Suppose that $g(y; \phi) \in \{-1,1\}$. With a bit of algebraic manipulations, we get that: \begin{align*} L_t & = e^{\alpha} \sum_{y^{(i)} \neq g(x^{(i)})} w^{(i)} + e^{-\alpha} \sum_{y^{(i)} = g(x^{(i)})} w^{(i)} \\ & = (e^{\alpha} - e^{-\alpha}) \sum_{i=1}^n w^{(i)} \mathbb{I}\{{y^{(i)} \neq g(x^{(i)})}\} + e^{-\alpha} \sum_{i=1}^n w^{(i)}.\\ \end{align*} where $\mathbb{I}\{\cdot\}$ is the indicator function.

From there, we get that: \begin{align*} \phi_t & = \arg\min_{\phi} \sum_{i=1}^n w^{(i)} \mathbb{I}\{{y^{(i)} \neq g(x^{(i)}; \phi)}\} \\ \alpha_t & = \log[(1-e_t)/e_t] \end{align*} where $e_t = \frac{\sum_{i=1}^n w^{(i)} \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\}}{\sum_{i=1}^n w^{(i)}\}}$.

These are update rules for Adaboost, and it's not hard to show that the update rule for $w^{(i)}$ is the same as well.

Squared Loss

Another popular choice of loss is the squared loss. $$ L(y, f) = (y-f)^2. $$

The resulting algorithm is often called L2Boost. At step $t$ we minimize $$\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \phi))^2, $$ where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$.

Logistic Loss

Another common loss is the log-loss. When $\mathcal{Y}=\{-1,1\}$ it is defined as:

$$L(y, f) = \log(1+\exp(-2\cdot y\cdot f)).$$

This looks like the log of the exponential loss; it is less sensitive to outliers since it doesn't penalize large errors as much.

In the context of boosting, we minimize $$J(\alpha, \phi) = \sum_{i=1}^n \log\left(1+\exp\left(-2y^{(i)}(f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi)\right)\right).$$

This give a different weight update compared to Adabost. This algorithm is called LogitBoost.

Pros and Cons of Boosting

The boosting algorithms we have seen so far improve over Adaboost.

Cons:

Summary

Part 3: Gradient Boosting

We are now going to see another way of deriving boosting algorithms that is inspired by gradient descent.

Review: Boosting

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

Review: Additive Models

Boosting can be seen as a way of fitting an additive model: $$ f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t). $$

This is not a linear model, because $g$ can be non-linear in $\phi_t$ (therefore so is $f$).

Review: Forward Stagewise Additive Modeling

A general way to fit additive models is the forward stagewise approach.

Losses for Additive Models

We have seen several losses that can be used with the forward stagewise additive approach.

Limitations of Forward Stagewise Additive Modeling

Forward stagewise additive modeling is not without limitations.

Functional Optimization

Functional optimization offers a different angle on boosting algorithms and a recipe for new algorithms.

To simplify our explanations, we will assume that there exists a true deterministic mapping $$f^* : \mathcal{X} \to \mathcal{Y}$$ between $\mathcal{X}$ and $\mathcal{Y}$, but the algorithm shown here works perfectly without this assumption.

Functional Gradients

Consider solving the optimization problem using gradient descent: $$J(f) = \min_f \sum_{i=1}^n L(y^{(i)}, f(x^{(i)})).$$

We may define the functional gradient of this loss at $f_0$ as a function $\nabla J(f_0) : \mathcal{X} \to \mathbb{R}$ $$\nabla J(f_0)(x) = \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$

Let's make a few observations about the functional gradient $$\nabla J(f_0)(x) = \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$

This is best understood via a picture.

Functional Gradient Descent

We can optimize our objective using gradient descent in functional space via the usual update rule: $$f \gets f - \alpha \nabla J(f).$$

As defined, this is not a practical algorithm:

Modeling Functional Gradients

We will address this problem by learning a model of gradients.

We will apply the same idea to gradients.

Functional descent then has the form: $$\underbrace{f(x)}_\text{new function} \gets \underbrace{f(x) - \alpha g(x)}_\text{old function - gradient step}.$$ If $g$ generalizes, this approximates $f \gets f - \alpha \nabla J(f).$

Fitting Functional Gradients

What does it mean to approximate a functional gradient $g \approx \nabla_\textbf{f} J(\textbf{f})$ in practice? We can use standard supervised learning.

Suppose we have a fixed function $f$ and we want to estimate the functional gradient of $L$ $$\frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$ at any $x \in \mathcal{X}$$

  1. We define a loss $L_g$ (e.g., L2 loss) measure how well $g \approx \nabla J(f)$.
  1. We compute $\nabla_\textbf{f} J(\textbf{f})$ on the training dataset: $$\mathcal{D}_g = \left\{ \left(x^{(i)}, \underbrace{\frac{\partial L(y^{(i)}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f(x^{(i)})}}_\text{functional derivative $\nabla_\textbf{f} J(\textbf{f})_i$ at $f(x^{(i)})$} \right), i=1,2,\ldots,n \right\} $$
  1. We train a model $g : \mathcal{X} \to \mathbb{R}$ on $\mathcal{D}_g$ to predict functional gradients at any $x$: $$ g(x) \approx \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$

Gradient Boosting

Gradient boosting is a procedure that performs functional gradient descent with approximate gradients.

Start with $f(x) = 0$. Then, at each step $t>1$:

  1. Create a training dataset $\mathcal{D}_g$ and fit $g_t(x^{(i)})$ using loss $L_g$: $$ g_t(x) \approx \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$
  1. Take a step of gradient descent using approximate gradients: $$f_t(x) = f_{t-1}(x) - \alpha \cdot g_t(x).$$

Interpreting Gradient Boosting

Notice how after $T$ steps we get an additive model of the form $$ f(x) = \sum_{t=1}^T \alpha_t g_t(x). $$ This looks like the output of a boosting algorithm!

Boosting vs. Gradient Boosting

Consider, for example, L2Boost, which optimizes the L2 loss $$ L(y, f) = \frac{1}{2}(y-f)^2. $$

At step $t$ we minimize $$\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \phi))^2, $$ where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$.

Observe that the residual $$r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$$ is also the gradient of the $L2$ loss with respect to $f$ as $f(x^{(i)})$ $$r^{(i)}_t = \frac{\partial L(y^{(i)}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x)}$$

Most boosting algorithms are special cases of gradient boosting in this way.

Losses for Gradient Boosting

Gradient boosting can optimize a wide range of losses.

  1. Regression losses:
    • L2, L1, and Huber (L1/L2 interpolation) losses.
    • Quantile loss: estimates quantiles of distribution of $p(y|x)$.
  2. Classification losses:
    • Log-loss, softmax loss, exponential loss, negative binomial likelihood, etc.

Practical Considerations

When using gradient boosting these additional facts are useful:

Algorithm: Gradient Boosting

Gradient Boosting: An Example

Let's now try running Gradient Boosted Decision Trees on a small regression dataset.

First we create the dataset.

Next, we train a GBDT regressor.

We may now visualize its predictions

Pros and Cons of Gradient Boosting

Gradient boosted decision trees (GBTs) are one of the best off-the-shelf ML algorithms that exist, often on par with deep learning.

Their main limitations are:

Functional Optimization

Functional optimization offers a different angle on boosting algorithms and a recipe for new algorithms.

Functional Gradients

Consider solving the optimization problem using gradient descent: $$J(\textbf{f}) = \min_\textbf{f} \sum_{i=1}^n L(y^{(i)}, \textbf{f}_i).$$ We may define the functional gradient of this loss as $$\nabla_\textbf{f} J(\textbf{f}) = \begin{bmatrix} \frac{\partial L(y^{(1)}, \textbf{f}_1)}{\partial \textbf{f}_1} \\ \frac{\partial L(y^{(2)}, \textbf{f}_2)}{\partial \textbf{f}_2} \\ \vdots \\ \frac{\partial L(y^{(n)}, \textbf{f}_n)}{\partial \textbf{f}_n} \\ \end{bmatrix}.$$

Functional Gradient Descent

We can optimize our objective using gradient descent in functional space via the usual update rule: $$\textbf{f} \gets \textbf{f} - \alpha \nabla_\textbf{f} J(\textbf{f}).$$

As defined, this is not a practical algorithm:

Modeling Functional Gradients

We will address this problem by learning a model of gradients.

In supervised learning, we define a model $f:\mathcal{X} \to \mathcal{Y}$ for $\textbf{f}$ within a class $\mathcal{M}$. \begin{align*} f \in \mathcal{M} & & f \approx \textbf{f} \end{align*} The model extrapolates beyond the training set and ensures we generalize.

We will apply the same idea to gradients. We assume a model $g : \mathcal{X} \to R$ of the functional gradient $\nabla_\textbf{f} J(\textbf{f})$ within a class $\mathcal{M}$. \begin{align*} g \in \mathcal{M} & & g \approx \nabla_\textbf{f} J(\textbf{f}) \end{align*} Our model of gradients can generalize beyond the training set.

Functional descent then has the form: $$\underbrace{f(x)}_\text{new function} \gets \underbrace{f(x) - \alpha g(x)}_\text{old function - gradient step}.$$ If $g$ generalizes, this approximates $\textbf{f} \gets \textbf{f} - \alpha \nabla_\textbf{f} J(\textbf{f})$ at any $n$ points.

Fitting Functional Gradients

What does it mean to approximate a functional gradient $g \approx \nabla_\textbf{f} J(\textbf{f})$ in practice? We can use standard supervised learning.

Suppose we have a fixed function $f$ and we want to estimate the functional gradient of $L$ $$ \frac{\partial L(y, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f(x)}$$ at any value of $f(x)$.

  1. We define a loss $L_g$ (e.g., L2 loss) measure how well $g \approx \nabla_\textbf{f} J(\textbf{f})$.
  1. We compute $\nabla_\textbf{f} J(\textbf{f})$ on the training dataset: $$\mathcal{D}_g = \left\{ \left(x^{(i)}, \underbrace{\frac{\partial L(y, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f(x^{(i)})}}_\text{functional derivative $\nabla_\textbf{f} J(\textbf{f})_i$ at $f(x^{(i)})$} \right), i=1,2,\ldots,n \right\} $$
  1. We train a model $g : \mathcal{X} \to \mathbb{R}$ on $\mathcal{D}_g$ to predict functional gradients at any $x$: $$ g(x) \approx \frac{\partial L(y, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f(x)}$$

Gradient Boosting

Gradient boosting is a procedure that performs functional gradient descent with approximate gradients.

Start with $f(x) = 0$. Then, at each step $t>1$:

  1. Create a training dataset $\mathcal{D}_g$ and fit $g_t(x^{(i)})$ using loss $L_g$: $$ g_t(x) \approx \frac{\partial L(y, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f(x)}.$$
  1. Take a step of gradient descent using approximate gradients: $$f_t(x) = f_{t-1}(x) - \alpha \cdot g_t(x).$$

Interpreting Gradient Boosting

Notice how after $T$ steps we get an additive model of the form $$ f(x) = \sum_{t=1}^T \alpha_t g_t(x). $$ This looks like the output of a boosting algorithm!

Losses for Gradient Boosting

Gradient boosting can optimize a wide range of losses.

  1. Regression losses:
    • L2, L1, and Huber (L1/L2 interpolation) losses.
    • Quantile loss: estimates quantiles of distribution of $p(y|x)$.
  2. Classification losses:
    • Log-loss, softmax loss, exponential loss, negative binomial likelihood, etc.

Practical Considerations

When using gradient boosting these additional facts are useful:

Algorithm: Gradient Boosting

Gradient Boosting: An Example

Let's now try running Gradient Boosted Decision Trees on a small regression dataset.

First we create the dataset.

Next, we train a GBDT regressor.

We may now visualize its predictions

Pros and Cons of Gradient Boosting

Gradient boosted decision trees (GBTs) are one of the best off-the-shelf ML algorithms that exist, often on par with deep learning.

Their main limitations are: