 # Lecture 4: Foundations of Supervised Learning¶

### Applied Machine Learning¶

Volodymyr Kuleshov
Cornell Tech

# Why Does Supervised Learning Work?¶

Prevously, we learned about supervised learning, derived our first algorithm, and used it to predict diabetes risk.

In this lecture, we are going to dive deeper into why supevised learning really works.

# Part 1: Data Distribution¶

First, let's look at the data, and define where it comes from.

Later, this will be useful to precisely define when supervised learning is guaranteed to work.

# Review: Components of A Supervised Machine Learning Problem¶

At a high level, a supervised machine learning problem has the following structure:

$$\underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model}$$

Where does the dataset come from?

# Data Distribution¶

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.

# Data Distribution: IID Sampling¶

The key assumption in that the training examples are independent and identicaly distributed (IID).

• Each training example is from the same distribution.
• This distribution doesn't depend on previous training examples.

Example: Flipping a coin. Each flip has same probability of heads & tails and doesn't depend on previous flips.

Counter-Example: Yearly census data. The population in each year will be close to that of the previous year.

# Data Distribution: Example¶

Let's implement an example of a data distribution in numpy.

Let's visualize it.

Let's now draw samples from the distribution. We will generate random $x$, and then generate random $y$ using $$y = f(x) + \epsilon$$ for a random noise variable $\epsilon$.

We can visualize the samples.

# Data Distribution: Motivation¶

Why assume that the dataset is sampled from a distribution?

• There is inherent uncertainty in the data. The data may consist of noisy measurements (readings from an imperfect thermometer).
• There is uncertainty in the process we model. If $y$ is a stock price, there is randomness in the market that cannot be modeled.
• We can use probability and statistics to analyze supervised learning algorithms and prove that they work. # Part 2: Why Does Supervised Learning Work?¶

We made the assumption that the training dataset is sampled from a data distribution.

Let's now use it to gain intuition about why supervised learning works.

# Review: Data Distribution¶

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.

# Review: Supervised Learning Model¶

We'll say that a model is a function $$f : \mathcal{X} \to \mathcal{Y}$$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

# What Makes A Good Model?¶

There are several things we may want out of a good model:

1. Interpretable features that explain how $x$ affects $y$.
2. Confidence intervals around $y$ (we will see later how to obtain these)
3. Accurate predictions of the targets $y$ from inputs $x$.

In this lecture, we fill focus on the latter.

# What Makes A Good Model?¶

A good predictive model is one that makes accurate predictions on new data that it has not seen at training time.

# Hold-Out Dataset: Definition¶

A hold-out dataset $$\dot{\mathcal{D}} = \{(\dot{x}^{(i)}, \dot{y}^{(i)}) \mid i = 1,2,...,m\}$$ is another dataset that is sampled IID from the same distribution $\mathbb{P}$ as the training dataset $\mathcal{D}$ and the two datasets are disjoint.

Let's genenerate a hold-out dataset for the example we saw earlier.

Let's genenerate a hold-out dataset for the example we saw earlier.

# Defining What is an Accurate Prediction¶

Suppose that we have a function $\texttt{isaccurate}(y, y')$ that determines if $y$ is an accurate estimate of $y'$, e.g.:

• Is the the target variable close enough to the true target? $$\texttt{isaccurate}(y, y') = \text{true if } (|y - y'| \text{ is small), else false}$$
• Did we predict the right class? $$\texttt{isaccurate}(y, y') = \text{true if } (y = y') \text{ else false}$$

This defines accuracy on a data point. We say a supervised learning model is accurate if it correctly predicts the target on new (held-out) data.

# Defining What is an Accurate Model¶

We can say that a predictive model $f$ is accurate if it's probability of making an error on a random holdout sample is small:

$$1 - \mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \leq \epsilon$$

for $\dot{x}, \dot{y} \sim \mathbb{P}$, for some small $\epsilon > 0$ and some definition of accuracy.

We can also say that a predictive model $f$ is inaccurate if it's probability of making an error on a random holdout sample is large:

$$1 - \mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \geq \epsilon$$

or equivalently

$$\mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \leq 1-\epsilon.$$

# Generalization¶

In machine learning, generalization is the property of predictive models to achieve good performance on new, heldout data that is distinct from the training set.

Will supervised learning return a model that generalizes?

# Recall: Supervised Learning¶

Recall our intuitive definition of supervised learning.