Prevously, we learned about supervised learning, derived our first algorithm, and used it to predict diabetes risk.

In this lecture, we are going to dive deeper into why supevised learning really works.

First, let's look at the data, and define where it comes from.

Later, this will be useful to precisely define when supervised learning is guaranteed to work.

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$Where does the dataset come from?

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as
$$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of *independent and identicaly distributed* (IID) samples from $\mathbb{P}$.

The key assumption in that the training examples are *independent and identicaly distributed* (IID).

- Each training example is from the same distribution.
- This distribution doesn't depend on previous training examples.

**Example**: Flipping a coin. Each flip has same probability of heads & tails and doesn't depend on previous flips.

**Counter-Example**: Yearly census data. The population in each year will be close to that of the previous year.

Let's implement an example of a data distribution in numpy.

In [1]:

```
import numpy as np
np.random.seed(0)
def true_fn(X):
return np.cos(1.5 * np.pi * X)
```

Let's visualize it.

In [2]:

```
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), label="True function")
plt.legend()
```

Out[2]:

<matplotlib.legend.Legend at 0x120e92668>

Let's now draw samples from the distribution. We will generate random $x$, and then generate random $y$ using $$ y = f(x) + \epsilon $$ for a random noise variable $\epsilon$.

In [3]:

```
n_samples = 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
```

We can visualize the samples.

In [4]:

```
plt.plot(X_test, true_fn(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.legend()
```

Out[4]:

<matplotlib.legend.Legend at 0x12111c860>

Why assume that the dataset is sampled from a distribution?

- There is inherent uncertainty in the data. The data may consist of noisy measurements (readings from an imperfect thermometer).

- There is uncertainty in the process we model. If $y$ is a stock price, there is randomness in the market that cannot be modeled.

- We can use probability and statistics to analyze supervised learning algorithms and prove that they work.

We made the assumption that the training dataset is sampled from a data distribution.

Let's now use it to gain intuition about why supervised learning works.

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as
$$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of *independent and identicaly distributed* (IID) samples from $\mathbb{P}$.

We'll say that a model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

There are several things we may want out of a good model:

- Interpretable features that explain how $x$ affects $y$.
- Confidence intervals around $y$ (we will see later how to obtain these)
- Accurate predictions of the targets $y$ from inputs $x$.

In this lecture, we fill focus on the latter.

A good predictive model is one that makes **accurate predictions** on **new data** that it has not seen at training time.

A hold-out dataset $$\dot{\mathcal{D}} = \{(\dot{x}^{(i)}, \dot{y}^{(i)}) \mid i = 1,2,...,m\}$$ is another dataset that is sampled IID from the same distribution $\mathbb{P}$ as the training dataset $\mathcal{D}$ and the two datasets are disjoint.

Let's genenerate a hold-out dataset for the example we saw earlier.

In [5]:

```
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
def true_fn(X):
return np.cos(1.5 * np.pi * X)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, true_fn(X_test), label="True function")
plt.legend()
```

Out[5]:

<matplotlib.legend.Legend at 0x12116be48>

Let's genenerate a hold-out dataset for the example we saw earlier.

In [6]:

```
n_samples, n_holdout_samples = 30, 30
X = np.sort(np.random.rand(n_samples))
y = true_fn(X) + np.random.randn(n_samples) * 0.1
X_holdout = np.sort(np.random.rand(n_holdout_samples))
y_holdout = true_fn(X_holdout) + np.random.randn(n_holdout_samples) * 0.1
plt.plot(X_test, true_fn(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.scatter(X_holdout, y_holdout, edgecolor='r', s=20, label="Holdout Samples")
plt.legend()
```

Out[6]:

<matplotlib.legend.Legend at 0x121440f28>

Suppose that we have a function $\texttt{isaccurate}(y, y')$ that determines if $y$ is an accurate estimate of $y'$, e.g.:

- Is the the target variable close enough to the true target? $$\texttt{isaccurate}(y, y') = \text{true if } (|y - y'| \text{ is small), else false}$$

- Did we predict the right class? $$\texttt{isaccurate}(y, y') = \text{true if } (y = y') \text{ else false} $$

This defines accuracy on a data point. We say a supervised learning model is accurate if it correctly predicts the target on *new (held-out) data*.

We can say that a predictive model $f$ is accurate if it's probability of making an error on a random holdout sample is small:

$$ 1 - \mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \leq \epsilon $$for $\dot{x}, \dot{y} \sim \mathbb{P}$, for some small $\epsilon > 0$ and some definition of accuracy.

We can also say that a predictive model $f$ is inaccurate if it's probability of making an error on a random holdout sample is large:

$$ 1 - \mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \geq \epsilon $$or equivalently

$$\mathbb{P} \left[ \texttt{isaccurate}(\dot y, f(\dot x)) \right] \leq 1-\epsilon.$$In machine learning, **generalization** is the property of predictive models to achieve good performance on new, heldout data that is distinct from the training set.

Will supervised learning return a model that generalizes?