 # Lecture 13: Boosting¶

### Applied Machine Learning¶

Volodymyr Kuleshov
Cornell Tech

# Part 1: Boosting and Ensembling¶

We are now going to look at ways in which multiple machine learning can be combined.

In particular, we will look at a way of combining models called boosting.

# Review: Components of A Supervised Machine Learning Problem¶

At a high level, a supervised machine learning problem has the following structure:

$$\underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model}$$

# Review: Overfitting¶

Overfitting is one of the most common failure modes of machine learning.

• A very expressive model (a high degree polynomial) fits the training dataset perfectly.
• The model also makes wildly incorrect prediction outside this dataset, and doesn't generalize.

# Review: Bagging¶

The idea of bagging is to reduce overfitting by averaging many models trained on random subsets of the data.

for i in range(n_models):
# collect data samples and fit models
X_i, y_i = sample_with_replacement(X, y, n_samples)
model = Model().fit(X_i, y_i)
ensemble.append(model)

# output average prediction at test time:
y_test = ensemble.average_prediction(y_test)


The data samples are taken with replacement and known as bootstrap samples.

# Review: Underfitting¶

Underfitting is another common problem in machine learning.

• The model is too simple to fit the data well (e.g., approximating a high degree polynomial with linear regression).
• As a result, the model is not accurate on training data and is not accurate on new data.

# Boosting¶

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

• As in bagging, we combine many models $g_t$ into one ensemble $f$.
• Unlike bagging, the $g_t$ are small and tend to underfit.
• Each $g_t$ fits the points where the previous models made errors.

# Weak Learners¶

A key ingredient of a boosting algorithm is a weak learner.

• Intuitively, this is a model that is slightly better than random.
• Examples of weak learners include: small linear models, small decision trees.

# Structure of a Boosting Algorithm¶

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

1. Fit a weak learner $g_0$ on dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$. Let $f=g$.
1. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors.
1. Fit another weak learner $g_1$ on $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$ with weights $w^{(i)}$.
1. Set $f_1 = g_0 + \alpha_1 g$ for some weight $\alpha_1$. Go to Step 2 and repeat.

In Python-like pseudocode this looks as follows:

weights = np.ones(n_data,)
for i in range(n_models):
model = SimpleBaseModel().fit(X, y, weights)
predictions = model.predict(X)
weights = update_weights(weights, predictions)

# output consensus prediction at test time:
y_test = ensemble.consensus_prediction(y_test)


# Origins of Boosting¶

Boosting algorithms were initially developed in the 90s within theoretical machine learning.

• Originally, boosting addressed a theoretical question of whether weak learners with >50% accuracy can be combined to form a strong learner.
• Eventually, this research led to a practical algorithm called Adaboost.

Today, there exist many algorithms that are considered types of boosting, even though they were not derived from a theoretical angle.

• Type: Supervised learning (classification).
• Model family: Ensembles of weak learners (often decision trees).
• Objective function: Exponential loss.
• Optimizer: Forward stagewise additive model building.

One of the first practical boosting algorithms was Adaboost.

We start with uniform $w^{(i)} = 1/n$ and $f = 0$. Then for $t=1,2,...,T$:

1. Fit weak learner $g_t$ on $\mathcal{D}$ with weights $w^{(i)}$.
1. Compute misclassification error $e_t = \frac{\sum_{i=1}^n w^{(i)} \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\}}{\sum_{i=1}^n w^{(i)}}$
1. Compute model weight $\alpha_t = \log[(1-e_t)/e_t]$. Set $f \gets f + \alpha_t g_t$.
1. Compute new data weights $w^{(i)} \gets w^{(i)}\exp[\alpha_t \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\} ]$.

Let's implement Adaboost on a simple dataset to see what it can do.

Let's start by creating a classification dataset.

We can visualize this dataset using matplotlib.

Let's now train Adaboost on this dataset.

Visualizing the output of the algorithm, we see that it can learn a highly non-linear decision boundary to separate the two classes.

# Ensembling¶

Boosting and bagging are special cases of ensembling.

The idea of ensembling is to combine many models into one. Bagging and Boosting are ensembling techniques to reduce over- and under-fitting.

• In stacking, we train $m$ independent models $g_j(x)$ (possibly from different model classes) and then train another model $f(x)$ to prodict $y$ from the outputs of the $g_j$.
• The Bayesian approach can also be seen as form of ensembling $$P(y\mid x) = \int_\theta P(y\mid x,\theta) P(\theta \mid \mathcal{D}) d\theta$$ where we average models $P(y\mid x,\theta)$ using weights $P(\theta \mid \mathcal{D})$.

# Pros and Cons of Ensembling¶

Ensembling is a useful tecnique in machine learning.

• It often helps squeeze out additional performance out of ML algorithms.
• Many algorithms (like Adaboost) are forms of ensembling.

• It can be computationally expensive to train and use ensembles. Next, we are going to see another perspective on boosting and derive new boosting algorithms.

# The Components of A Supervised Machine Learning Algorithm¶

We can define the high-level structure of a supervised learning algorithm as consisting of three components:

• A model class: the set of possible models we consider.
• An objective function, which defines how good a model is.
• An optimizer, which finds the best predictive model in the model class according to the objective function

# Review: Underfitting¶

Underfitting is another common problem in machine learning.

• The model is too simple to fit the data well (e.g., approximating a high degree polynomial with linear regression).
• As a result, the model is not accurate on training data and is not accurate on new data.

# Review: Boosting¶

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

• As in bagging, we combine many models $g_i$ into one ensemble $f$.
• Unlike bagging, the $g_i$ are small and tend to underfit.
• Each $g_i$ fits the points where the previous models made errors.

Boosting can be seen as a way of fitting an additive model: $$f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t).$$

• The main model $f(x)$ consists of $T$ smaller models $g$ with weights $\alpha_t$ and paramaters $\phi_t$.
• The parameters are the $\alpha_t$ plus the parameters $\phi_t$ of each $g$.

This is more general than a linear model, because $g$ can be non-linear in $\phi_t$ (therefore so is $f$).

# Example: Boosting Algorithms¶

Boosting is one way of training additive models.

1. Fit a weak learner $g_0$ on dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$. Let $f=g$.
1. Compute weights $w^{(i)}$ for each $i$ based on model predictions $f(x^{(i)})$ and targets $y^{(i)}$. Give more weight to points with errors.
1. Fit another weak learner $g_1$ on $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$ with weights $w^{(i)}$.
1. Set $f_1 = g_0 + \alpha_1 g$ for some weight $\alpha_1$. Go to Step 2 and repeat.

A general way to fit additive models is the forward stagewise approach.

• Suppose we have a loss $L : \mathcal{Y} \times \mathcal{Y} \to [0, \infty)$.
• Start with $f_0 = \arg \min_\phi \sum_{i=1}^n L(y^{(i)}, g(x^{(i)}; \phi))$.
• At each iteration $t$ we fit the best addition to the current model. $$\alpha_t, \phi_t = \arg\min_{\alpha, \phi} \sum_{i=1}^n L(y^{(i)}, f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi))$$

# Practical Considerations¶

• Popular choices of $g$ include cubic splines, decision trees and kernelized models.
• We may use a fix number of iterations $T$ or early stopping when the error on a hold-out set no longer improves.
• An important design choice is the loss $L$.

# Exponential Loss¶

Give a binary classification problem with labels $\mathcal{Y} = \{-1, +1\}$, the exponential loss is defined as

$$L(y, f) = \exp(-y \cdot f).$$
• When $y=1$, $L$ is small when $f \to \infty$.
• When $y=-1$, $L$ is small when $f \to - \infty$.

Let's visualize the exponential loss and compare it to other losses.

Adaboost is an instance of forward stagewise additive modeling with the expoential loss.

At each step $t$ we minimize $$L_t = \sum_{i=1}^n e^{-y^{(i)}(f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi))} = \sum_{i=1}^n w^{(i)} \exp\left(-y^{(i)}\alpha g(x^{(i)}; \phi)\right)$$ with $w^{(i)} = \exp(-y^{(i)}f_{t-1}(x^{(i)}))$.

We can derive the Adaboost update rules from this equation.

Suppose that $g(y; \phi) \in \{-1,1\}$. With a bit of algebraic manipulations, we get that: \begin{align*} L_t & = e^{\alpha} \sum_{y^{(i)} \neq g(x^{(i)})} w^{(i)} + e^{-\alpha} \sum_{y^{(i)} = g(x^{(i)})} w^{(i)} \\ & = (e^{\alpha} - e^{-\alpha}) \sum_{i=1}^n w^{(i)} \mathbb{I}\{{y^{(i)} \neq g(x^{(i)})}\} + e^{-\alpha} \sum_{i=1}^n w^{(i)}.\\ \end{align*} where $\mathbb{I}\{\cdot\}$ is the indicator function.

From there, we get that: \begin{align*} \phi_t & = \arg\min_{\phi} \sum_{i=1}^n w^{(i)} \mathbb{I}\{{y^{(i)} \neq g(x^{(i)}; \phi)}\} \\ \alpha_t & = \log[(1-e_t)/e_t] \end{align*} where $e_t = \frac{\sum_{i=1}^n w^{(i)} \mathbb{I}\{y^{(i)} \neq f(x^{(i)})\}}{\sum_{i=1}^n w^{(i)}\}}$.

These are update rules for Adaboost, and it's not hard to show that the update rule for $w^{(i)}$ is the same as well.

# Squared Loss¶

Another popular choice of loss is the squared loss. $$L(y, f) = (y-f)^2.$$

The resulting algorithm is often called L2Boost. At step $t$ we minimize $$\sum_{i=1}^n (r^{(i)}_t - g(x^{(i)}; \phi))^2,$$ where $r^{(i)}_t = y^{(i)} - f(x^{(i)})_{t-1}$ is the residual from the model at time $t-1$.

# Logistic Loss¶

Another common loss is the log-loss. When $\mathcal{Y}=\{-1,1\}$ it is defined as:

$$L(y, f) = \log(1+\exp(-2\cdot y\cdot f)).$$

This looks like the log of the exponential loss; it is less sensitive to outliers since it doesn't penalize large errors as much.

In the context of boosting, we minimize $$J(\alpha, \phi) = \sum_{i=1}^n \log\left(1+\exp\left(-2y^{(i)}(f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi)\right)\right).$$

This give a different weight update compared to Adabost. This algorithm is called LogitBoost.

# Pros and Cons of Boosting¶

The boosting algorithms we have seen so far improve over Adaboost.

• They optimize a wide range of objectives.
• Thus, they are more robust to outliers and extend beyond classification.

Cons:

• Computational time is still an issue.
• Optimizing greedily over each $\phi_t$ can take time.
• Each loss requires specialized derivations.

# Summary¶

• Additive models have the form $$f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t).$$
• These models can be fit using the forward stagewise additive approach.
• This reproduces Adaboost and can be used to derive new boosting-type algorithms. We are now going to see another way of deriving boosting algorithms that is inspired by gradient descent.

# Review: Boosting¶

The idea of boosting is to reduce underfitting by combining models that correct each others' errors.

• As in bagging, we combine many models $g_i$ into one ensemble $f$.
• Unlike bagging, the $g_i$ are small and tend to underfit.
• Each $g_i$ fits the points where the previous models made errors.

Boosting can be seen as a way of fitting an additive model: $$f(x) = \sum_{t=1}^T \alpha_t g(x; \phi_t).$$

• The main model $f(x)$ consists of $T$ smaller models $g$ with weights $\alpha_t$ and paramaters $\phi_t$.
• The parameters are the $\alpha_t$ plus the parameters $\phi_t$ of each $g$.

This is not a linear model, because $g$ can be non-linear in $\phi_t$ (therefore so is $f$).

# Review: Forward Stagewise Additive Modeling¶

A general way to fit additive models is the forward stagewise approach.

• Suppose we have a loss $L : \mathcal{Y} \times \mathcal{Y} \to [0, \infty)$.
• Start with $f_0 = \arg \min_\phi \sum_{i=1}^n L(y^{(i)}, g(x^{(i)}; \phi))$.
• At each iteration $t$ we fit the best addition to the current model. $$\alpha_t, \phi_t = \arg\min_{\alpha, \phi} \sum_{i=1}^n L(y^{(i)}, f_{t-1}(x^{(i)}) + \alpha g(x^{(i)}; \phi))$$

We have seen several losses that can be used with the forward stagewise additive approach.

• The exponential loss $L(y,f) = \exp(-yf)$ gives us Adaboost.
• The log-loss $L(y,f) = \log(1+\exp(-2yf))$ is more robust to outliers.
• The squared loss $L(y,f) = (y-f)^2$ can be used for regression.

# Limitations of Forward Stagewise Additive Modeling¶

Forward stagewise additive modeling is not without limitations.

• There may exist other losses for which it is complex to derive boosting-type weight update rules.
• At each step, we may need to solve a costly optimization problem over $\phi_t$.
• Optimizing each $\phi_t$ greedily may cause us to overfit.

# Functional Optimization¶

Functional optimization offers a different angle on boosting algorithms and a recipe for new algorithms.

• Consider optimizing a loss over arbitrary functions $f: \mathcal{X} \to \mathcal{Y}$.
• Functional optimization consists in solving the problem $$\min_f \sum_{i=1}^n L(y^{(i)}, f(x^{(i)})).$$ over the space of all possible $f$.
• It's easiest to think about $f$ as an infinite dimensional vector indexed by $x \in \mathcal{X}$.

To simplify our explanations, we will assume that there exists a true deterministic mapping $$f^* : \mathcal{X} \to \mathcal{Y}$$ between $\mathcal{X}$ and $\mathcal{Y}$, but the algorithm shown here works perfectly without this assumption.

Consider solving the optimization problem using gradient descent: $$J(f) = \min_f \sum_{i=1}^n L(y^{(i)}, f(x^{(i)})).$$
We may define the functional gradient of this loss at $f_0$ as a function $\nabla J(f_0) : \mathcal{X} \to \mathbb{R}$ $$\nabla J(f_0)(x) = \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$
Let's make a few observations about the functional gradient $$\nabla J(f_0)(x) = \frac{\partial L(\text{y}, \text{f})}{\partial \text{f}} \bigg\rvert_{\text{f} = f_0(x), \text{y} = f^*(x)}.$$
• It's an object indexed by $x \in \mathcal{X}$.
• At each $x \in \mathcal{X}$, $\nabla J(f_0)(x)$ tells us how to modify $f_0(x)$ to make $L(f^*(x), f_0(x))$ smaller.
• This is consistent with the fact that we are optimizing over a "vector" $f$, also indexed by $x \in \mathcal{X}$.