Lecture 21: Model Iteration and Improvement

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Practical Considerations When Applying Machine Learning

Suppose you trained an image classifier with 80% accuracy. What's next?

We will next learn how to prioritize these decisions when applying ML.

Fast & Data-Driven Model Iteration

The key to building great ML systems is to be data-driven:

This process can compensate for an initial lack of domain expertise.

Part 1: Error Analysis

A crucial part of understanding model performance is to systematically examine its errors.

Error analysis is a formal process by which this can be done.

Review: Model Development Workflow

The machine learning development workflow has three steps:

  1. Training: Try a new model and fit it on the training set.
  1. Model Selection: Estimate performance on the development set using metrics. Based on results, try a new model idea in step #1.
  1. Evaluation: Finally, estimate real-world performance on test set.

Review: Datasets for Model Development

When developing machine learning models, it is customary to work with three datasets:

How to Analyze Models

How do we iterate on a model given its results on the develoment set?

And many more types of analyses!

Prioritizng Model Improvements: An Example

Suppose you trained an image classifier over animal photos.

The key to this decision is to closely examine actual performance.

Suppose you need to decide if it's worth fixing a certain type of error.

If 5% of misclassified examples have that problem, it's probably not important. If 50% do, then it's important.

Error Analysis

Error analysis systematically identifies the most common errors made by the model.

  1. Collect a random subset of misclassified dev set examples.
  2. Manually examine them; identify the most common types or categories of errors.
  3. Count \% of points affected by each error categories.

You should prioritize the most common error categories.

Error Categories

Error analysis involves classifying errors into categories.

Error Analysis: An Example

Suppose you just trained a new image classifier.

We then go through through the random subset of errors and assign them to categories.

$ $ Blurry Flipped Mislabeled
Image 1 X X
Image 2 X
Image 3 X
...
Total 20% 50% 30%

We know that the most important fix is to correct for flipped images.

Mislabeled Data

Real-world data is often messy, and labels are not always correct.

It's important to fix labeling issues if they prevent us from measuring model error.

Development Set Size

How big should the dev set size be? Error analysis suggests a lower bound.

Also, remember to periodically update the dev set to minimize overfitting.

Error Analysis on the Training Set

Should we perform error analysis on the dev set or the training set?

Hence, analyzing and fixing training set errors is also important.

Error Analysis: Another Example

Let's look at another example of error analysis on a small toy dataset.

We will use the sklearn digits dataset, a downscaled version of MNIST.

We can visualize these digits as follows:

Let's separate this data into two equal-sized training and dev sets.

We can train a simple Softmax model on this data.

It achieves the following accuracy.

We hypothesize that certain digits are misclassified more than others.

The most common misclassified digit is a 3.

We can investigate the issue by looking at a subset of misclassified 3's (top row) and compare them to correctly classified 3's (bottom row).

We discover that the model is misclassifying a particular stlye of 3, and we can focus our efforts on this type of error.

Limitations of Error Analysis

The main limitations of error analysis include:

  1. It is particularly easy to overfit the dev set, since we prioritize fixing dev set errors.
  2. It can be laborious (but still important!)
  3. Certain bigger trends (e.g. overfitting the data) may be less obvious.

Hence, we perform other analyses to explain and diagnose errors.

Part 2: Bias/Variance Analysis

Another way to understand the performance of the model is to examine the extent to which it's overfitting or underfitting the data.

We refer to this as bias/variance analysis.

Review: Error Analysis

Error analysis systematically identifies the most common errors made by the model.

  1. Collect a random subset of misclassified dev set examples.
  2. Manually examine them; identify the most common types or categories of errors.
  3. Count \% of points affected by each error categories.

You should prioritize the most common error categories.

Review: Overfitting (Variance)

Overfitting is one of the most common failure modes of machine learning.

Recall this example, where we randomly sample around a true function.

Below, we fit a high degree polynomial on random samples of 30 points from this dataset.

Each small subset of the data that we train on results is a very different model.

An algorithm that has a tendency to overfit is also called high variance, because it outputs a predictive model that varies a lot if we slightly perturb the dataset.

Review: Underfitting (Bias)

Underfitting is another common problem in machine learning.

Because the model cannot fit the data, we say it's high bias.

We may compare overfitting vs underfitting on our polynomial dataset.

On the Signficance of Bias and Variance

Every error in machine learning is either underfitting (bias) or overfitting (variance).

By definition, if we have no bias and variance, we have a perfect model. Hence, bias/variance is important to understand.

Quantifying Bias and Variance

We approximately quantify the bias and the variance of a model as follows.

$$\text{dev error} = (\underbrace{\text{dev error} - \text{train error}}_\text{variance}) + \underbrace{\text{train error}}_\text{bias}$$

It's important to consider both types of errors.

We can make different changes to the algorithm to address both of these issues.

Diagnosing Bias and Variance

We may use this observation to diagnose bias/variance in practice.

Consider the following example:

This is a typical example of high bias (underfitting).

Next, consider another example:

This is an example of high variance (overfitting).

Finally, suppose you see the following:

This is a model that seems to work quite well!

Addressing Variance

The best way to reduce variance is to give the model more data.

However, this may be not be feasible because of high costs for compute or data acquisition.

Alternative options for reducing variance include:

Addressing Bias

The best way to reduce bias is to increase the expressivity of the model.

However, this may be not be feasible because of high costs for compute.

Alternative options for reducing bias include:

For both bias and variance reduction, we can use error analysis to guide our changes, e.g.:

Bias/Variance Analysis: An Example

Let's use our earlier example with the sklearn digits dataset to illustrate this approach.

Recall our digits dataset from earlier:

We can train a small fully-connected neural network on this data.

It achieves the following accuracy.

We have clearly memorized our dataset, and are overfitting. Let's increase regularization.

By increasing L2 regularization (alpha), we improve performance by 1%. (Although we still somewhat overfit)

Error vs. Bias/Variance Analyses

These two analyses reveal different types of problems:

These two analyses also complement each other.

Bias/Variance Analysis vs Hyperparameter Search

Bias/variance analyses also helps guide hyperparameter search

Model Iteration Cycle

In summary, ML model development can be seen as alternating between the following two steps:

Error analysis guides specific changes in this process.

Part 3: Baselines

In order to understand model performance, we need to put it in context.

Baselines represent a benchmark against which we compare performance.

Motivation

Suppose you train a regression model with a mean L1 error of 20.

Thus, we need to put our results in context by comparing to other models.

Baselines

A baseline is a another model against which we compare ourselves.

Examples of baselines include:

Optimal Performance

In practice, we also want to set a target upper bound on our performance.

Estimating the Optimal Error Rate

There are different ways to compute an upper bound:

Quantifying Bias Using Optimal Error

Our target optimal error helps us better quantify bias and variance:

$$\text{dev error} = (\underbrace{\text{dev error} - \text{train error}}_\text{variance}) + (\underbrace{\text{train error} - \text{opt error}}_\text{avoidable bias}) + \underbrace{\text{opt error}}_\text{unavoidable bias}$$

Consider the following example:

The bias is almost ideal. We have a variance problem.

Next, consider this scenario:

Training error is less than the ideal error! This means that we have overfit the training set. We have a variance problem.

Finally, consider another example:

We are close to being optimal!