Suppose you trained an image classifier with 80% accuracy. What's next?
We will next learn how to prioritize these decisions when applying ML.
Overfitting is one of the most common failure modes of machine learning.
Models that overfit are said to be high variance.
Underfitting is another common problem in machine learning.
Because the model cannot fit the data, we say it's high bias.
Learning curves show performance as a function of training set size.
Learning curves are defined for fixed hyperparameters. Observe that dev set error decreases as we give the model more data.
It is often very useful to have a target upper bound on performance (e.g., human accuracy); it can also be visualized on the learning curve.
In the example below, the dev error has plateaued and we know that adding more data will not be useful.
We can further augment this plot with training set performance.
A few observations can be made here:
Learning curves can reveal when we have a bias problem.
In practice, in can be hard to visually assess if the dev error has plateaued. Adding the training error makes this easier.
The following plot shows we have high variance.
In this plot, we have both high variance and high bias.
To further illustrate the idea of learning curves, consider the following example.
We will use the sklearn
digits dataset, a downscaled version of MNIST.
# Example from https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
We can visualize these digits as follows:
from matplotlib import pyplot as plt
plt.figure(figsize=(8,16))
_, axes = plt.subplots(2, 5)
images_and_labels = list(zip(digits.images, digits.target))
for ax, (image, label) in zip(axes.flatten(), images_and_labels[:10]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Digit %i' % label)
<Figure size 576x1152 with 0 Axes>
This is boilerplate code for visualizing learning curves and it's not essential to understand this example.
import numpy as np
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
"""Generate learning curves for an algorithm."""
if axes is None:
_, axes = plt.subplots(1, 3, figsize=(20, 5))
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
# Plot learning curve
axes[0].grid()
axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training Accuracy")
axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Dev Set Accuracy")
axes[0].legend(loc="best")
return plt
We visualize learning curves for two algorithms:
from sklearn.model_selection import ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
# This is a technical detail, but we will obtain dev set performance via
# cross-valation rather that via a dev set.
# Cross validation is a technique that emulates a separate dev set with small data.
# We also use 100 iterations to get smoother mean test and train curves,
# each time with 20% data randomly selected as a validation set.
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
title = "Learning Curves (Naive Bayes)"
plot_learning_curve(GaussianNB(), title, X, y, axes=[axes[0]], ylim=(0.7, 1.01), cv=cv, n_jobs=4)
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
plot_learning_curve(SVC(gamma=0.001), title, X, y, axes=[axes[1]], ylim=(0.7, 1.01),cv=cv, n_jobs=4)
<module 'matplotlib.pyplot' from '/Users/kuleshov/work/env/aml/lib/python3.6/site-packages/matplotlib/pyplot.py'>
We can draw a few takeways:
The main limitations of learning curves include:
The machine learning development workflow has three steps:
Many algorithms minimize a loss function using an iterative optimization procedure like gradient descent.
Loss curves plot the training objective as a function of the number of training steps on training or development datasets.
A few observations can be made here:
A failure mode of some machine learning algorithms is overtraining.
A closely related problem is undertraining: not training the model for long enough.
This can be diagnosed via a learning curve that shows that dev set performance is still on an improving trajectory.
Loss curves also enable diagnosing optimization problems.
Each line is a loss curve with a different learning rate (LR).
The red loss curve is not too fast and not too slow.
Advantages of using loss curves include the following.
Loss curves don't diagnose the utility of adding more data; when bias/variance diagnosis is ambiguous, use learning curves.
Validation curves help us understand the effects of different hyper-parameters.
The machine learning development workflow has three steps:
ML models normally have hyper-parameters, e.g. L2 regularization strength, neural net layer size, number of K-Means clusters, etc.
Loss curves plot model peformance as a function of hyper-parameter values on training or development datasets.
Consider the following example, in which we train a Ridge model on the digits dataset.
Recall the digits dataset introduced earlier in this lecture.
from matplotlib import pyplot as plt
plt.figure(figsize=(8,16))
_, axes = plt.subplots(2, 5)
images_and_labels = list(zip(digits.images, digits.target))
for ax, (image, label) in zip(axes.flatten(), images_and_labels[:10]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Digit %i' % label)
<Figure size 576x1152 with 0 Axes>
We can train an SVM with and RBF kernel for different values of bandwidth $\gamma$ using the validation_curve
function.
from sklearn.model_selection import validation_curve
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
SVC(), X, y, param_name="gamma", param_range=param_range,
scoring="accuracy", n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
We visualize this as follows.
plt.title("Validation Curve with SVM")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(param_range, train_scores_mean, label="Training accuracy",
color="darkorange", lw=lw)
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2,
color="darkorange", lw=lw)
plt.semilogx(param_range, test_scores_mean, label="Validation accuracy",
color="navy", lw=lw)
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2,
color="navy", lw=lw)
plt.legend(loc="best")
<matplotlib.legend.Legend at 0x12cce2f98>
This shows that the SVM:
Medium values of $\gamma$ are just right.
So far, we assumed that the distributions across different datasets are relatively similar.
When that is not the case, we may run into errors.
When developing machine learning models, it is customary to work with three datasets:
The development and test sets should be from the data distribution we will see in production.
We talk about distribution mismatch when the previously stated conditions don't hold, i.e. we have the following:
In order to diagnose mismatch problems between the training and dev sets, we may create a new dataset.
The training dev set is a random subset of the training set used as a second validation set.
We may use this new dataset to diagnose distribution mismatch. Suppose dev set error is high.
As an example, suppose are building a cat image classifier.
Consider the following example:
This is a typical example of high variance (overfitting).
Next, consider another example:
This looks like an example of high avoidable bias (underfitting).
Finally, suppose you see the following:
This is a model that is generalizing to the training dev set, but not the standard dev set. Distribution mismatch is a problem.
We may quantify this issue more precisely using the following decomposition.
\begin{align*} \text{dev error} & = (\underbrace{\text{dev error} - \text{dev train error}}_\text{distribution mismatch}) \\ & + (\underbrace{\text{dev train error} - \text{train error}}_\text{variance}) + (\underbrace{\text{train error} - \text{opt error}}_\text{avoidable bias}) \\ & + \underbrace{\text{opt error}}_\text{unavoidable bias} \end{align*}We may also apply this analysis to the dev and test sets to determine if they're stale.
Correcting data mismatch requires:
The best way to understand data mismatch is using error analysis.