Lecture 4: Foundations of Supervised Learning

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Why Does Supervised Learning Work?

Prevously, we learned about supervised learning, derived our first algorithm, and used it to predict diabetes risk.

In this lecture, we are going to dive deeper into why supevised learning really works.

Part 1: Data Distribution

First, let's look at the data, and define where it comes from.

Later, this will be useful to precisely define when supervised learning is guaranteed to work.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

Where does the dataset come from?

Data Distribution

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.

Data Distribution: IID Sampling

The key assumption in that the training examples are independent and identicaly distributed (IID).

Example: Flipping a coin. Each flip has same probability of heads & tails and doesn't depend on previous flips.

Counter-Example: Yearly census data. The population in each year will be close to that of the previous year.

Data Distribution: Example

Let's implement an example of a data distribution in numpy.

Let's visualize it.

Let's now draw samples from the distribution. We will generate random $x$, and then generate random $y$ using $$ y = f(x) + \epsilon $$ for a random noise variable $\epsilon$.

We can visualize the samples.

Data Distribution: Motivation

Why assume that the dataset is sampled from a distribution?

Part 2: Why Does Supervised Learning Work?

We made the assumption that the training dataset is sampled from a data distribution.

Let's now use it to gain intuition about why supervised learning works.

Review: Data Distribution

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.

Review: Supervised Learning Model

We'll say that a model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

What Makes A Good Model?

There are several things we may want out of a good model:

  1. Interpretable features that explain how $x$ affects $y$.
  2. Confidence intervals around $y$ (we will see later how to obtain these)
  3. Accurate predictions of the targets $y$ from inputs $x$.

In this lecture, we fill focus on the latter.

What Makes A Good Model?

A good predictive model is one that makes accurate predictions on new data that it has not seen at training time.

Hold-Out Dataset: Definition

A hold-out dataset $$\dot{\mathcal{D}} = \{(\dot{x}^{(i)}, \dot{y}^{(i)}) \mid i = 1,2,...,m\}$$ is another dataset that is sampled IID from the same distribution $\mathbb{P}$ as the training dataset $\mathcal{D}$ and the two datasets are disjoint.

Let's genenerate a hold-out dataset for the example we saw earlier.

Let's genenerate a hold-out dataset for the example we saw earlier.

Defining What is an Accurate Prediction

Suppose that we have a function $\texttt{isaccurate}(y, y')$ that determines if $y$ is an accurate estimate of $y'$, e.g.:

This defines accuracy on a data point. We say a supervised learning model is accurate if it correctly predicts the target on new (held-out) data.

Defining What is an Accurate Model

We can say that a predictive model $f$ is accurate if it's probability of making an error on a random holdout sample is small:

$$ 1 - \mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \leq \epsilon $$

for $\dot{x}, \dot{y} \sim \mathbb{P}$, for some small $\epsilon > 0$ and some definition of accuracy.

We can also say that a predictive model $f$ is inaccurate if it's probability of making an error on a random holdout sample is large:

$$ 1 - \mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \geq \epsilon $$

or equivalently

$$\mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \leq 1-\epsilon.$$

Generalization

In machine learning, generalization is the property of predictive models to achieve good performance on new, heldout data that is distinct from the training set.

Will supervised learning return a model that generalizes?

Recall: Supervised Learning

Recall our intuitive definition of supervised learning.