Lecture 20: Evaluating Machine Learning Models

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Practical Considerations When Applying Machine Learning

Suppose you trained an image classifier with 80% accuracy. What's next?

We look at how to prioritize decisions to produce performant ML systems.

Part 1: Machine Learning Development Workflow

In order to iterate and improve upon machine learning models, practitioners follow a development workflow.

We first define it at a high-level. Afterwards, we will describe each step in more detail.

Review: Data Distribution

In machine learning, we typically assume that data comes from a probability distribution $\mathbb{P}$, which we will call the data distribution:

$$ x, y \sim \mathbb{P}. $$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.

Review: Hold-Out Set

A hold-out set $\dot{\mathcal{D}} = \{(\dot{x^{(i)}}, \dot{y^{(i)}}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$ and is distinct from the training set.

A model that generalizes is accurate on a hold-out set.

We present a workflow for developing accurate models that generalize.

Datasets for Model Development

When developing machine learning models, it is customary to work with three datasets:

Model Development Workflow

The typical way in which these datasets are used is:

  1. Training: Try a new model and fit it on the training set.
  1. Model Selection: Estimate performance on the development set using metrics. Based on results, try a new model idea in step #1.
  1. Evaluation: Finally, estimate real-world performance on test set.

A few extra notes about this procedure:

Development and Test Sets

Choosing a Test Set

The test set is used to esimate real-world performance.

Choosing Dev and Test Sets

How should one choose the development and test set? We highlight two considerations.

Distributional Consistency: The development and test sets should be from the data distribution we will see in production.

Dataset Size: Dev and test datasets need to estimate future performance.

Model Selection

Here, we again highlight two considerations.

Choosing Metrics: The model development workflow requires optimizing a single performance metric.

Updating the Model: We select hyperparameters based on dev set performance and:

We will provide much more detail on the intuition part later!

Example: Training a Neural Net

Consider a workflow for building a neural image classifier.

  1. We start with a standard CNN that gets 90\% dev set accuracy.
  1. We tune dropout via grid search on dev set; accuracy is 95\% now.
  1. We try a new idea -- we add residual connections to the CNN and retrain it. This brings dev set accuracy to 99\%!
  1. We are happy with this performance. We measure test set accuracy: 97\%, still quite good!

Limitations of the ML Workflow

You may encounter a number of issues:

  1. Overfitting dev set after repeatedly choosing the best model on it.
  2. Dev and test sets may no longer represent true data distribution.
  3. The metric may no longer measure true performance.

In such cases you need to collect more data and/or change the metric.

Part 2: Evaluating Classification Models

The first step towards building better ML models is to determine how to evaluate them.

We will start by talking about how to evaluate classification models.

Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete, and takes on one of $K$ possible values.

When classification labels take $K=2$ values, we perform binary classification.

An example of a classification task is the Iris flower dataset.

We may visualize this dataset in 2D.

Review: Machine Learning Models

A machine learning model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

Below, we fit a Softmax model to the Iris dataset.

Classification Accuracy

The simplest and most natural metric for classification algorithms is accuracy: $$\text{acc}(f) = \frac{1}{n}\sum_{i=1}^n \mathbb{I}\{f(x) = y\},$$ where $\mathbb{I}\{\cdot\}$ is an indicator function (equals 1 if its input is true and zero otherwise).

Confusion Matrix

We can better understand classification error via a confusion matrix.

Accuracy is a problematic metric when classes are imbalanced.

It easy to achieve high accuracy just by being accurate on the more frequent class (by always predicting it, for example).

Metrics for Binary Classification

We can look at performance in a more precise way when we do binary classification. Consider the following confusion matrix.

$ $ Predicted positive $\hat y=1$ Predicted negative $\hat y=0$
Positive class $y=1$ True positive (TP) False negative (FN)
Negative class $y=0$ False positive (FP) True negative (TN)

We can define accuracy as follows: $$\text{accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$$ This is the number of correction predictions divided by the total number of predictions.

Sensitivity and Specificity

$ $ Predicted positive $\hat y=1$ Predicted negative $\hat y=0$
Positive class $y=1$ True positive (TP) False negative (FN)
Negative class $y=0$ False positive (FP) True negative (TN)

We can also look at "accuracy" on each class separately. This reveals problems with imbalanced classes. \begin{align*} \underset{\text{(recall, true positive rate)}}{\text{sensitivity}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \underset{\text{(true negative rate)}}{\text{specificity}} & = \frac{\text{TN}}{\text{negative class}} = \frac{\text{TN}}{\text{TN} + \text{FP}} \\ \end{align*}

We can combine these into a single measure called balanced accuracy

\begin{align*} \text{balanced accuracy} & = \frac{1}{2}\left(\text{specificity} + \text{sensitivity}\right) \\ & = \frac{1}{2}\left(\frac{\text{TN}}{\text{TN} + \text{FP}} + \frac{\text{TP}}{\text{TP} + \text{FN}}\right) \end{align*}

Precision and Recall

$ $ Predicted positive $\hat y=1$ Predicted negative $\hat y=0$
Positive class $y=1$ True positive (TP) False negative (FN)
Negative class $y=0$ False positive (FP) True negative (TN)

An alternative set of measures is precision and recall. \begin{align*} \underset{\text{(positive predictive value)}}{\text{precision}} & = \frac{\text{TP}}{\text{predicted positive}} = \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \underset{\text{(sensitivity, true positive rate)}}{\text{recall}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \end{align*}

Imagine we are building a search engine. The positive examples are the pages that are relevant to the users.

Notice that we don't directly report performance on negatives (what % of irrelevant pages were labeled as such).

Whe do we choose precision and recall vs. sensitivity and specificity?

F-Score

The F-Score is the geometric mean of precision and recall. $$\text{F-Score} = \frac{2}{\frac{1}{\text{precision}} + \frac{1}{\text{recall}}}$$

It equals one at prefect prediction and recall and zero if one of precision or recall is zero.

Part 3: Advanced Classification Metrics

Next, we look a few more advanced classification metrics.

Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete, and takes on one of $K$ possible values.

When classification labels take $K=2$ values, we perform binary classification.

Review: Sensitivity and Specificity

$ $ Predicted positive $\hat y=1$ Predicted negative $\hat y=0$
Positive class $y=1$ True positive (TP) False negative (FN)
Negative class $y=0$ False positive (FP) True negative (TN)

We can also look at "accuracy" on each class separately. This reveals problems with imbalanced classes. \begin{align*} \underset{\text{(recall, true positive rate)}}{\text{sensitivity}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \underset{\text{(true negative rate)}}{\text{specificity}} & = \frac{\text{TN}}{\text{negative class}} = \frac{\text{TN}}{\text{TN} + \text{FP}} \\ \end{align*}

Trading Off Sensitivity and Specificity

Suppose that true positives are more important than true negatives.

Most classifiers come with confidence scores that make this easy to do.

With our softmax model, we can simpliy obtain the class probabilities.

The default threshold for predicting class 1 in binary classification is when it has >50\% probability. But we can set it higher or lower.

Receiver Operating Characteristic (ROC)

In binary classification, the Receiver Operating Characteristic (ROC) curve plots the true positive rate and the false positive rate (FPR) as we vary the threshold for labeling a positive example.

\begin{align*} \text{TPR} = \underset{\text{(recall, sensitivity)}}{\text{true positive rate}} & = \frac{\text{TP}}{\text{positive class}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \text{FPR} = 1-\underset{\text{(true negative rate)}}{\text{specificity}} & = 1 - \frac{\text{TN}}{\text{negative class}} = \frac{\text{FP}}{\text{TP} + \text{FP}} \\ \end{align*}

Suppose we want to improve sensitivity for Class 2 on the Iris dataset. We first compute the probability $p(y=2|x)$ for each input $x$. For any threshold $t>0$, we label $x$ as Class 2 if $p(y=2|x)>t$.

Bewlow, the ROC curve measures the TPR and FPR as we vary $t$.

We can visualize the TPR vs. the FPR at various thresholds.

We highlight the following properties of the ROC curve:

Area Under the Curve

We can use the area under the curve (AUC) as a single measure of classifier performance.

We may compute the AUC of the above ROC curve as follows.

Multi-Class Generalizations

We can also define sensitivity, specificity, and other metrics for each class in multi-class classification.

In multi-class settings, we average binary metrics in various ways:

See the scikit-learn guide for more on model evaluation.

Part 4: Evaluating Regression Models

The first step towards building better ML models is to determine how to evaluate them.

Next, we look at regression models.

Review: Regression

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete, and takes on one of $K$ possible values.

Review: Machine Learning Models

A machine learning model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

Regression Losses

Most standard regression losses can be used as evaluation metrics.

These metrics have a number of limitations:

Scaled Losses

To account for differences in scale, we may work with scaled losses:

The SMAPE allows either $y^{(i)}$ or $f(x^{(i)})$ to be small (or zero).

Scaled Logarithmic Losses

Another way to account for error in $f(x^{(i)})$ relative to $y^{(i)}$ is by taking the log of both values. This puts them on the same scale. $$\frac{1}{n} \sum_{i=1}^n \left| \log(1 + y^{(i)}) - \log(1 + f(x^{(i)})) \right|$$ This is called the mean absolute logarithmic error (MALE).

The Coefficient of Determination

The coefficient of determination, usually denoted by $R^2$, measures how accuracy of the predictions, relative to constantly predicting the average $\bar y = \frac{1}{n}\sum_{i=1}^n y^{(i)}$: $$R^2 = 1 - \left(\frac{\sum_{i=1}^n \left( f(x^{(i)}) - y^{(i)} \right)^2}{\sum_{i=1}^n \left( \bar y - y^{(i)} \right)^2}\right).$$

An $R^2$ of one corresponds to perfect accuracy. An $R^2$ of zero means that $f$ is not better than the average prediction.

Part 5: Cross-Validation