Suppose you trained an image classifier with 80% accuracy. What's next?
We will next learn how to prioritize these decisions when applying ML.
The key to building great ML systems is to be data-driven:
This process can compensate for an initial lack of domain expertise.
A crucial part of understanding model performance is to systematically examine its errors.
Error analysis is a formal process by which this can be done.
The machine learning development workflow has three steps:
When developing machine learning models, it is customary to work with three datasets:
How do we iterate on a model given its results on the develoment set?
And many more types of analyses!
Suppose you trained an image classifier over animal photos.
The key to this decision is to closely examine actual performance.
Suppose you need to decide if it's worth fixing a certain type of error.
If 5% of misclassified examples have that problem, it's probably not important. If 50% do, then it's important.
Error analysis systematically identifies the most common errors made by the model.
You should prioritize the most common error categories.
Error analysis involves classifying errors into categories.
Suppose you just trained a new image classifier.
We then go through through the random subset of errors and assign them to categories.
$ $ | Blurry | Flipped | Mislabeled |
---|---|---|---|
Image 1 | X | X | |
Image 2 | X | ||
Image 3 | X | ||
... | |||
Total | 20% | 50% | 30% |
We know that the most important fix is to correct for flipped images.
Real-world data is often messy, and labels are not always correct.
It's important to fix labeling issues if they prevent us from measuring model error.
How big should the dev set size be? Error analysis suggests a lower bound.
Also, remember to periodically update the dev set to minimize overfitting.
Should we perform error analysis on the dev set or the training set?
Hence, analyzing and fixing training set errors is also important.
Let's look at another example of error analysis on a small toy dataset.
We will use the sklearn
digits dataset, a downscaled version of MNIST.
from sklearn.datasets import load_digits
digits = load_digits()
We can visualize these digits as follows:
from matplotlib import pyplot as plt
plt.figure(figsize=(8,16))
_, axes = plt.subplots(2, 5)
images_and_labels = list(zip(digits.images, digits.target))
for ax, (image, label) in zip(axes.flatten(), images_and_labels[:10]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Digit %i' % label)
<Figure size 576x1152 with 0 Axes>
Let's separate this data into two equal-sized training and dev sets.
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
from sklearn.model_selection import train_test_split
# Split data into train and test subsets
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
X_train, X_dev, y_train, y_dev = train_test_split(
data, digits.target, test_size=0.5, shuffle=False)
We can train a simple Softmax model on this data.
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=1e7)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)
# Now predict the value of the digit on the second half:
predicted = classifier.predict(X_dev)
It achieves the following accuracy.
(predicted == y_dev).mean()
0.9332591768631813
We hypothesize that certain digits are misclassified more than others.
# these dev set digits are classified incorrectly
X_error = X_dev[predicted != y_dev]
y_error = y_dev[predicted != y_dev]
p_error = predicted[predicted != y_dev]
# these dev set digits are classified correctly
X_corr = X_dev[predicted == y_dev]
y_corr = y_dev[predicted == y_dev]
p_corr = predicted[predicted == y_dev]
# show the histogram
plt.xticks(range(10))
plt.hist(y_error)
(array([ 3., 11., 2., 14., 7., 4., 1., 4., 9., 5.]), array([0. , 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3, 7.2, 8.1, 9. ]), <BarContainer object of 10 artists>)
The most common misclassified digit is a 3.
We can investigate the issue by looking at a subset of misclassified 3's (top row) and compare them to correctly classified 3's (bottom row).
plt.figure(figsize=(8,16))
_, axes = plt.subplots(2, 8)
# these images are classified incorrectly
images_and_labels = list(zip(X_error[y_error==3], p_error[y_error==3]))
for ax, (image, label) in zip(axes[0,:], images_and_labels[:8]):
ax.set_axis_off()
ax.imshow(image.reshape((8,8)), cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('f(x)=%i' % label)
# these images are classified correctly
images_and_labels = list(zip(X_corr[y_corr==3], p_corr[y_corr==3]))
for ax, (image, label) in zip(axes[1,:], images_and_labels[:8]):
ax.set_axis_off()
ax.imshow(image.reshape((8,8)), cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('f(x)=%i' % label)
<Figure size 576x1152 with 0 Axes>
We discover that the model is misclassifying a particular stlye of 3, and we can focus our efforts on this type of error.
The main limitations of error analysis include:
Hence, we perform other analyses to explain and diagnose errors.
Another way to understand the performance of the model is to examine the extent to which it's overfitting or underfitting the data.
We refer to this as bias/variance analysis.
Error analysis systematically identifies the most common errors made by the model.
You should prioritize the most common error categories.
Overfitting is one of the most common failure modes of machine learning.
Recall this example, where we randomly sample around a true function.
import numpy as np
np.random.seed(1)
n_samples = 40
true_fn = lambda X: np.cos(1.5 * np.pi * X)
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
X_line = np.linspace(0, 1, 100)
plt.plot(X_line, true_fn(X_line), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
<matplotlib.collections.PathCollection at 0x1271c8828>
Below, we fit a high degree polynomial on random samples of 30 points from this dataset.
Each small subset of the data that we train on results is a very different model.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
n_plots, X_line = 3, np.linspace(0,1,20)
plt.figure(figsize=(14, 5))
for i in range(n_plots):
ax = plt.subplot(1, n_plots, i + 1)
random_idx = np.random.randint(0, 30, size=(30,))
X_random, y_random = X[random_idx], y[random_idx]
polynomial_features = PolynomialFeatures(degree= z, include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X_random[:, np.newaxis], y_random)
ax.plot(X_line, true_fn(X_line), label="True function")
ax.plot(X_line, pipeline.predict(X_line[:, np.newaxis]), label="Model")
ax.scatter(X_random, y_random, edgecolor='b', s=20, label="Samples", alpha=0.2)
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title('Random sample %d' % i)
An algorithm that has a tendency to overfit is also called high variance, because it outputs a predictive model that varies a lot if we slightly perturb the dataset.
Underfitting is another common problem in machine learning.
Because the model cannot fit the data, we say it's high bias.
We may compare overfitting vs underfitting on our polynomial dataset.
degrees = [1, 20, 5]
titles = ['Underfitting', 'Overfitting', 'A Good Fit']
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("pf", polynomial_features), ("lr", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
ax.plot(X_test, true_fn(X_test), label="True function")
ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
ax.scatter(X, y, edgecolor='b', s=20, label="Samples", alpha=0.2)
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc="best")
ax.set_title("{} (Degree {})".format(titles[i], degrees[i]))
Every error in machine learning is either underfitting (bias) or overfitting (variance).
By definition, if we have no bias and variance, we have a perfect model. Hence, bias/variance is important to understand.
We approximately quantify the bias and the variance of a model as follows.
$$\text{dev error} = (\underbrace{\text{dev error} - \text{train error}}_\text{variance}) + \underbrace{\text{train error}}_\text{bias}$$It's important to consider both types of errors.
We can make different changes to the algorithm to address both of these issues.
We may use this observation to diagnose bias/variance in practice.
Consider the following example:
This is a typical example of high bias (underfitting).
Next, consider another example:
This is an example of high variance (overfitting).
Finally, suppose you see the following:
This is a model that seems to work quite well!
The best way to reduce variance is to give the model more data.
However, this may be not be feasible because of high costs for compute or data acquisition.
Alternative options for reducing variance include:
The best way to reduce bias is to increase the expressivity of the model.
However, this may be not be feasible because of high costs for compute.
Alternative options for reducing bias include:
For both bias and variance reduction, we can use error analysis to guide our changes, e.g.:
Let's use our earlier example with the sklearn digits dataset to illustrate this approach.
Recall our digits dataset from earlier:
from matplotlib import pyplot as plt
plt.figure(figsize=(8,16))
_, axes = plt.subplots(2, 5)
images_and_labels = list(zip(digits.images, digits.target))
for ax, (image, label) in zip(axes.flatten(), images_and_labels[:10]):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
ax.set_title('Digit %i' % label)
<Figure size 576x1152 with 0 Axes>
We can train a small fully-connected neural network on this data.
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)
# Now predict the value of the digit:
predicted = classifier.predict(X_dev)
predicted_train = classifier.predict(X_train)
It achieves the following accuracy.
print('Training set accuracy: %.3f ' % (predicted_train == y_train).mean())
print('Development set accuracy: %.3f ' % (predicted == y_dev).mean())
Training set accuracy: 1.000 Development set accuracy: 0.937
We have clearly memorized our dataset, and are overfitting. Let's increase regularization.
classifier = MLPClassifier(max_iter=1000, alpha=1)
# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)
# Now predict the value of the digit:
predicted = classifier.predict(X_dev)
predicted_train = classifier.predict(X_train)
By increasing L2 regularization (alpha
), we improve performance by 1%.
(Although we still somewhat overfit)
print('Training set accuracy: %.3f ' % (predicted_train == y_train).mean())
print('Development set accuracy: %.3f ' % (predicted == y_dev).mean())
Training set accuracy: 1.000 Development set accuracy: 0.947
These two analyses reveal different types of problems:
These two analyses also complement each other.
Bias/variance analyses also helps guide hyperparameter search
In summary, ML model development can be seen as alternating between the following two steps:
Error analysis guides specific changes in this process.
In order to understand model performance, we need to put it in context.
Baselines represent a benchmark against which we compare performance.
Suppose you train a regression model with a mean L1 error of 20.
Thus, we need to put our results in context by comparing to other models.
A baseline is a another model against which we compare ourselves.
Examples of baselines include:
In practice, we also want to set a target upper bound on our performance.
There are different ways to compute an upper bound:
Our target optimal error helps us better quantify bias and variance:
$$\text{dev error} = (\underbrace{\text{dev error} - \text{train error}}_\text{variance}) + (\underbrace{\text{train error} - \text{opt error}}_\text{avoidable bias}) + \underbrace{\text{opt error}}_\text{unavoidable bias}$$Consider the following example:
The bias is almost ideal. We have a variance problem.
Next, consider this scenario:
Training error is less than the ideal error! This means that we have overfit the training set. We have a variance problem.
Finally, consider another example:
We are close to being optimal!