Suppose you trained an image classifier with 80% accuracy. What's next?
We look at how to prioritize decisions to produce performant ML systems.
In order to iterate and improve upon machine learning models, practitioners follow a development workflow.
We first define it at a high-level. Afterwards, we will describe each step in more detail.
In machine learning, we typically assume that data comes from a probability distribution $\mathbb{P}$, which we will call the data distribution:
$$ x, y \sim \mathbb{P}. $$The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.
A hold-out set $\dot{\mathcal{D}} = \{(\dot{x^{(i)}}, \dot{y^{(i)}}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$ and is distinct from the training set.
A model that generalizes is accurate on a hold-out set.
We present a workflow for developing accurate models that generalize.
When developing machine learning models, it is customary to work with three datasets:
The typical way in which these datasets are used is:
A few extra notes about this procedure:
The test set is used to esimate real-world performance.
How should one choose the development and test set? We highlight two considerations.
Distributional Consistency: The development and test sets should be from the data distribution we will see in production.
Dataset Size: Dev and test datasets need to estimate future performance.
Here, we again highlight two considerations.
Choosing Metrics: The model development workflow requires optimizing a single performance metric.
Updating the Model: We select hyperparameters based on dev set performance and:
We will provide much more detail on the intuition part later!
Consider a workflow for building a neural image classifier.
You may encounter a number of issues:
In such cases you need to collect more data and/or change the metric.
The first step towards building better ML models is to determine how to evaluate them.
We will start by talking about how to evaluate classification models.
Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.
We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.
When classification labels take $K=2$ values, we perform binary classification.
An example of a classification task is the Iris flower dataset.
# import standard machine learning libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris() # load the Iris dataset
X, y = iris.data[:120, :2], iris.target[:120] # create imbalanced classes and only use first 2 features
X, X_holdout, y, y_holdout = train_test_split(X, y, test_size=50, random_state=0)
We may visualize this dataset in 2D.
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
# Visualize the Iris flower dataset
setosa_flowers = (iris.target == 0)
plt.scatter(X[:,0], X[:,1], c=y, cmap=plt.cm.Paired)
plt.ylabel("Sepal width (cm)")
plt.xlabel("Sepal length (cm)")
Text(0.5, 0, 'Sepal length (cm)')
A machine learning model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.
Below, we fit a Softmax model to the Iris dataset.
from sklearn.linear_model import LogisticRegression
# fit a softmax regression model (implemented in LogisticRegression in sklearn)
model = LogisticRegression()
model.fit(X,y)
y_pred = model.predict(X_holdout)
The simplest and most natural metric for classification algorithms is accuracy: $$\text{acc}(f) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}\{f(x) = y\},$$ where $\mathbb{I}\{\cdot\}$ is an indicator function (equals 1 if its input is true and zero otherwise).
accuracy = (y_pred == y_holdout).mean()
print('Iris holdout set accuracy: %.2f' % accuracy)
Iris holdout set accuracy: 0.84
We can better understand classification error via a confusion matrix.
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(model, X_holdout, y_holdout)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x12241b4e0>
Accuracy is a problematic metric when classes are imbalanced.
It easy to achieve high accuracy just by being accurate on the more frequent class (by always predicting it, for example).
We can look at performance in a more precise way when we do binary classification. Consider the following confusion matrix.
$ $ | Predicted positive $\hat y=1$ | Predicted negative $\hat y=0$ |
---|---|---|
Positive class $y=1$ | True positive (TP) | False negative (FN) |
Negative class $y=0$ | False positive (FP) | True negative (TN) |
We can define accuracy as follows: $$\text{accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$$ This is the number of correction predictions divided by the total number of predictions.
$ $ | Predicted positive $\hat y=1$ | Predicted negative $\hat y=0$ |
---|---|---|
Positive class $y=1$ | True positive (TP) | False negative (FN) |
Negative class $y=0$ | False positive (FP) | True negative (TN) |
We can also look at "accuracy" on each class separately. This reveals problems with imbalanced classes. \begin{align*} \underset{\text{(recall, true positive rate)}}{\text{sensitivity}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \underset{\text{(true negative rate)}}{\text{specificity}} & = \frac{\text{TN}}{\text{negative class}} = \frac{\text{TN}}{\text{TN} + \text{FP}} \\ \end{align*}
We can combine these into a single measure called balanced accuracy
\begin{align*} \text{balanced accuracy} & = \frac{1}{2}\left(\text{specificity} + \text{sensitivity}\right) \\ & = \frac{1}{2}\left(\frac{\text{TN}}{\text{TN} + \text{FP}} + \frac{\text{TP}}{\text{TP} + \text{FN}}\right) \end{align*}$ $ | Predicted positive $\hat y=1$ | Predicted negative $\hat y=0$ |
---|---|---|
Positive class $y=1$ | True positive (TP) | False negative (FN) |
Negative class $y=0$ | False positive (FP) | True negative (TN) |
An alternative set of measures is precision and recall. \begin{align*} \underset{\text{(positive predictive value)}}{\text{precision}} & = \frac{\text{TP}}{\text{predicted positive}} = \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \underset{\text{(sensitivity, true positive rate)}}{\text{recall}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \end{align*}
Imagine we are building a search engine. The positive examples are the pages that are relevant to the users.
Notice that we don't directly report performance on negatives (what % of irrelevant pages were labeled as such).
Whe do we choose precision and recall vs. sensitivity and specificity?
The F-Score is the geometric mean of precision and recall. $$\text{F-Score} = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}}$$
It equals one at prefect prediction and recall and zero if one of precision or recall is zero.
Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.
We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.
When classification labels take $K=2$ values, we perform binary classification.
$ $ | Predicted positive $\hat y=1$ | Predicted negative $\hat y=0$ |
---|---|---|
Positive class $y=1$ | True positive (TP) | False negative (FN) |
Negative class $y=0$ | False positive (FP) | True negative (TN) |
We can also look at "accuracy" on each class separately. This reveals problems with imbalanced classes. \begin{align*} \underset{\text{(recall, true positive rate)}}{\text{sensitivity}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \underset{\text{(true negative rate)}}{\text{specificity}} & = \frac{\text{TN}}{\text{negative class}} = \frac{\text{TN}}{\text{TN} + \text{FP}} \\ \end{align*}
Suppose that true positives are more important than true negatives.
Most classifiers come with confidence scores that make this easy to do.
With our softmax model, we can simpliy obtain the class probabilities.
The default threshold for predicting class 1 in binary classification is when it has >50\% probability. But we can set it higher or lower.
pred_probabilities = model.predict_proba(X_holdout)
print('Predicted probabilities of class 0 from the model:')
print(pred_probabilities[:10,0])
Predicted probabilities of class 0 from the model: [0.90500627 0.1628024 0.26248388 0.91899931 0.05452794 0.98241773 0.90444375 0.87543531 0.83226506 0.12415388]
In binary classification, the Receiver Operating Characteristic (ROC) curve plots the true positive rate and the false positive rate (FPR) as we vary the threshold for labeling a positive example.
\begin{align*} \text{TPR} = \underset{\text{(recall, sensitivity)}}{\text{true positive rate}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \text{FPR} = 1-\underset{\text{(true negative rate)}}{\text{specificity}} & = 1 - \frac{\text{TN}}{\text{negative class}} = \frac{\text{FP}}{\text{TP} + \text{FP}} \\ \end{align*}Suppose we want to improve sensitivity for Class 2 on the Iris dataset. We first compute the probability $p(y=2|x)$ for each input $x$. For any threshold $t>0$, we label $x$ as Class 2 if $p(y=2|x)>t$.
Bewlow, the ROC curve measures the TPR and FPR as we vary $t$.
from sklearn.metrics import roc_curve
class2_scores = pred_probabilities[:,2] # we take class 2 as the "positive" class
# create labels where class 2 is the "positive" class
class2_y = np.zeros(y_holdout.shape)
class2_y[y_holdout==2] = 1
print('First class 2 scores: ', class2_scores[:4])
fpr, tpr, thresholds = roc_curve(class2_y, class2_scores)
First class 2 scores: [0.02495207 0.15064683 0.17470687 0.00545101]
We can visualize the TPR vs. the FPR at various thresholds.
plt.figure(figsize=(6,6))
plt.plot(fpr, tpr, color='darkorange')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
Text(0.5, 1.0, 'Receiver operating characteristic')
We highlight the following properties of the ROC curve:
We can use the area under the curve (AUC) as a single measure of classifier performance.
We may compute the AUC of the above ROC curve as follows.
from sklearn.metrics import auc
print('AUC-ROC: %.4f' % auc(fpr, tpr))
AUC-ROC: 0.8555
We can also define sensitivity, specificity, and other metrics for each class in multi-class classification.
In multi-class settings, we average binary metrics in various ways:
macro
: We average binary one-vs-all metrics for each class.
$$\text{precision}_\text{macro} = \frac{1}{K} \sum_{k=1}^K \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k}$$micro
: We average binary metrics for each point.
$$\text{precision}_\text{micro} = \frac{\sum_{k=1}^K \text{TP}_k}{\sum_{k=1}^K (\text{TP}_k + \text{FP}_k)}$$See the scikit-learn
guide for more on model evaluation.
from sklearn.metrics import classification_report
print(classification_report(y_holdout, y_pred, target_names=['Setosa', 'Versicolor', 'Virginica']))
precision recall f1-score support Setosa 0.95 1.00 0.97 19 Versicolor 0.79 0.92 0.85 24 Virginica 0.50 0.14 0.22 7 accuracy 0.84 50 macro avg 0.75 0.69 0.68 50 weighted avg 0.81 0.84 0.81 50
The first step towards building better ML models is to determine how to evaluate them.
Next, we look at regression models.
Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.
We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.
A machine learning model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.
Most standard regression losses can be used as evaluation metrics.
from sklearn.metrics import mean_squared_error, mean_absolute_error
y1 = np.array([1, 2, 3, 4])
y2 = np.array([-1, 1, 3, 5])
print('Mean squared error: %.2f' % mean_squared_error(y1, y2))
print('Mean absolute error: %.2f' % mean_absolute_error(y1, y2))
Mean squared error: 1.50 Mean absolute error: 1.00
These metrics have a number of limitations:
To account for differences in scale, we may work with scaled losses:
The SMAPE allows either $y^{(i)}$ or $f(x^{(i)})$ to be small (or zero).
Another way to account for error in $f(x^{(i)})$ relative to $y^{(i)}$ is by taking the log of both values. This puts them on the same scale. $$\frac{1}{n} \sum_{i=1}^n \left| \log(1 + y^{(i)}) - \log(1 + f(x^{(i)})) \right|$$ This is called the mean absolute logarithmic error (MALE).
The coefficient of determination, usually denoted by $R^2$, measures how accuracy of the predictions, relative to constantly predicting the average $\bar y = \frac{1}{n}\sum_{i=1}^n y^{(i)}$: $$R^2 = 1 - \left(\frac{\sum_{i=1}^n \left( f(x^{(i)}) - y^{(i)} \right)^2}{\sum_{i=1}^n \left( \bar y - y^{(i)} \right)^2}\right).$$
An $R^2$ of one corresponds to perfect accuracy. An $R^2$ of zero means that $f$ is not better than the average prediction.