 # Lecture 7: Generative Algorithms¶

### Applied Machine Learning¶

Volodymyr Kuleshov
Cornell Tech

# Part 1: Generative Models¶

In this lecture, we are going to look at generative algorithms and their applications to classification.

We will start by defining the concept of a generative model.

# Review: Components of A Supervised Machine Learning Problem¶

At a high level, a supervised machine learning problem has the following structure:

$$\underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model}$$

# Review: Probabilistic Models¶

A (parametric) probabilistic model with parameters $\theta$ is a probability distribution $$P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1].$$ This model can approximate the data distribution $\mathbb{P}(x,y)$.

If we know $P_\theta(x,y)$, we can compute predictions using the formula $$P_\theta(y|x) = \frac{P_\theta(x,y)}{P_\theta(x)} = \frac{P_\theta(x,y)}{\sum_{y \in \mathcal{Y}} P_\theta(x, y)}.$$

# Review: Maximum Likelihood Learning¶

In order to fit probabilistic models, we use the following objective: $$\max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(x, y).$$ This seeks to find a model that assigns high probability to the training data.

# Review: Conditional Probabilistic Models¶

Alternatively, we may define a model of the conditional probability distribution: $$P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1].$$

These are trained using conditional maximum likelihood: $$\max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(y|x).$$ This seeks to find a model that assigns high conditional probability to the target $y$ for each $x$.

Logistic regression is an example of this approach.

# Discriminative vs. Generative Models¶

These two types of models are also known as generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

• The models parametrize different kinds of probabilities
• They involve different training objectives and make different predictions
• Their uses are different (e.g., prediction, generation); more later!

# Classification Dataset: Iris Flowers¶

To demonstrate the two approaches, we are going to use the Iris flower dataset.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

If we only consider the first two feature columns, we can visualize the dataset in 2D.

# Example: Discriminative Model¶

An example of a discriminative model is logistic or softmax regression.

• Discriminative models directly partition the feature space into regions associated with each class and separated by a decision boundary.
• Given features $x$, discriminative models directly map to predicted classes (e.g., via the function $\sigma(\theta^\top x)$ for logistic regression).

# Example: Generative Model¶

Generative modeling can be seen as taking a different approach:

1. In the Iris example, we first build a model of how each type of flower looks, i.e. we can learn the distribution $$p(x | y=k) \; \text{for each class k}.$$ It defines a model of how each flower is generated, hence the name.
1. Given a new flower datapoint $x'$, we can match it against each flower model and find the type of flower that looks most similar to it. Mathematically, this corresponds to: \begin{align*} \arg \max_y \log p(y | x) & = \arg \max_y \log \frac{p(x | y) p(y)}{p(x)} \\ & = \arg \max_y \log p(x | y) p(y), \end{align*} where we have applied Bayes' rule in the first line.

# Generative vs. Discriminative Approaches¶

How do we know which approach is better?

• If we only care about prediction, we don't need a model of $P(x)$. We can solve precisely the problem we care about.
• Discriminative models will often be more accurate.
• If we care about other tasks (generation, dealing with missing values, etc.) or if we know the true model is generative, we want to use the generative approach.

More on this later! # Part 2: Gaussian Discriminant Analysis¶

We are now going to continue our discussion of classification.

• We will see a new classification algorithm, Gaussian Discriminant Analysis.
• This will be our first example of generative machine learning model.

# Review: Classification¶

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

# Review: Generative Models¶

There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

• They involve different training objectives and make different predictions
• Their uses are different (e.g., prediction, generation); more later!

# Mixtures of Gaussians¶

A mixture of $K$ Gaussians is a distribution $P(x)$ of the form:

$$\phi_1 \mathcal{N}(x; \mu_1, \Sigma_1) + \phi_2 \mathcal{N}(x; \mu_2, \Sigma_2) + \ldots + \phi_K \mathcal{N}(x; \mu_K, \Sigma_K).$$
• Each $\mathcal{N}(x; \mu_k, \Sigma_k)$ is a (multivariate) Gaussian distribution with mean $\mu_k$ and covariance $\Sigma_k$.
• The $\phi_k$ are weights, and the above sum is a weighted average of the $K$ Gaussians.

We can easily visualize this in 1D:

# Gaussian Discriminant Model¶

We may use this approach to define a model $P_\theta$. This will be the basis of an algorthim called Gaussian Discriminant Analysis.

• The distribution over classes is Categorical, denoted $\text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$. Thus, $P_\theta(y=k) = \phi_k$.
• The conditional probability $P_\theta(x\mid y=k)$ of the data under class $k$ is a multivariate Gaussian $\mathcal{N}(x; \mu_k, \Sigma_k)$ with mean and covariance $\mu_k, \Sigma_k$.

Thus, $P_\theta(x,y)$ is a mixture of $K$ Gaussians: $$P_\theta(x,y) = \sum_{k=1}^K P_\theta(y=k) P_\theta(x|y=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

Intuitively, this model defines a story for how the data was generated. To obtain a data point,

• First, we sample a class $y \sim \text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$ with class proportions given by the $\phi_k$.
• Then, we sample an $x$ from a Gaussian distribution $\mathcal{N}(\mu_k, \Sigma_k)$ specific to that class.

Such a story can be constructed for most generative algorithms and helps understand them.

# Classification Dataset: Iris Flowers¶

To demonstrate this approach, we are going to use the Iris flower dataset.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

If we only consider the first two feature columns, we can visualize the dataset in 2D.

# Example: Iris Flower Classification¶

Let's see how this approach can be used in practice on the Iris dataset.

• We will "guess" a good set of parameters for a Gaussian Discriminant model
• We will sample from the model and compare to the true data
• Our Gaussian Discirminant model generates data that looks not unlike the real data.
• Let's now see how we can learn parameters from data and use the model to make predictions. # Part 3: Gaussian Discriminant Analysis: Learning¶

We continue our discussion of Gaussian Discriminant analysis, and look at:

• How to learn parameters of the mixture model
• How to use the model to make predictions

# Review: Classification¶

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

# Review: Gaussian Discriminant Model¶

We may define a model $P_\theta$ as follows. This will be the basis of an algorthim called Gaussian Discriminant Analysis.

• The distribution over classes is Categorical, denoted $\text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$. Thus, $P_\theta(y=k) = \phi_k$.
• The conditional probability $P(x\mid y=k)$ of the data under class $k$ is a multivariate Gaussian $\mathcal{N}(x; \mu_k, \Sigma_k)$ with mean and covariance $\mu_k, \Sigma_k$.

Thus, $P_\theta(x,y)$ is a mixture of $K$ Gaussians: $$P_\theta(x,y) = \sum_{k=1}^K P_\theta(y=k) P_\theta(x|y=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

# Review: Maximum Likelihood Learning¶

In order to fit probabilistic models, we use the following objective: $$\max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(x, y).$$ This seeks to find a model that assigns high probability to the training data.

Let's use maximum likelihood to fit the Guassian Discriminant model. Note that model parameterss $\theta$ are the union of the parameters of each sub-model: $$\theta = (\mu_1, \Sigma_1, \phi_1, \ldots, \mu_K, \Sigma_K, \phi_K).$$

Mathematically, the components of the model $P_\theta(x,y)$ are as follows. \begin{align*} P_\theta(y) & = \frac{\prod_{k=1}^K \phi_k^{\mathbb{I}\{y = y_k\}}}{\sum_{k=1}^k \phi_k} \\ P_\theta(x|y=k) & = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(x-\mu_k)^\top\Sigma_k^{-1}(x-\mu_k)) \end{align*}

# Optimizing the Log Likelihood¶

Given a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\mid i=1,2,\ldots,n\}$, we want to optimize the log-likelihood $\ell(\theta)$: \begin{align*} \ell(\theta) & = \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)}) = \sum_{i=1}^n \log P_\theta(x^{(i)} | y^{(i)}) + \sum_{i=1}^n \log P_\theta(y^{(i)}) \\ & = \sum_{k=1}^K \underbrace{\sum_{i : y^{(i)} = k} \log P(x^{(i)} | y^{(i)} ; \mu_k, \Sigma_k)}_\text{all the terms that involve $\mu_k, \Sigma_k$} + \underbrace{\sum_{i=1}^n \log P(y^{(i)} ; \vec \phi)}_\text{all the terms that involve $\vec \phi$}. \end{align*}

Notice that each set of parameters $(\mu_k, \Sigma_k)$ is found in only one term of the summation over the $K$ classes and the $\phi_k$ are also in the same term.

Since each $(\mu_k, \Sigma_k)$ for $k=1,2,\ldots,K$ is found in one term, optimization over $(\mu_k, \Sigma_k)$ can be carried out independently of all the other parameters by just looking at that term: \begin{align*} \max_{\mu_k, \Sigma_k} \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)}) & = \max_{\mu_k, \Sigma_k} \sum_{l=1}^K \sum_{i : y^{(i)} = l} \log P_\theta(x^{(i)} | y^{(i)} ; \mu_l, \Sigma_l) \\ & = \max_{\mu_k, \Sigma_k} \sum_{i : y^{(i)} = k} \log P_\theta(x^{(i)} | y^{(i)} ; \mu_k, \Sigma_k). \end{align*}

Similarly, optimizing for $\vec \phi = (\phi_1, \phi_2, \ldots, \phi_K)$ only involves a single term: $$\max_{\vec \phi} \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)} ; \theta) = \max_{\vec\phi} \ \sum_{i=1}^n \log P_\theta(y^{(i)} ; \vec \phi).$$

# Optimizing the Class Probabilities¶

These observations greatly simplify the optimization of the model. Let's first consider the optimization over $\vec \phi = (\phi_1, \phi_2, \ldots, \phi_K)$. From the previous anaylsis, our objective $J(\vec \phi)$ equals \begin{align*} J(\vec\phi) & = \sum_{i=1}^n \log P_\theta(y^{(i)} ; \vec \phi) \\ & = \sum_{i=1}^n \log \phi_{y^{(i)}} - n \cdot \log \sum_{k=1}^K \phi_k \\ & = \sum_{k=1}^K \sum_{i : y^{(i)} = k} \log \phi_k - n \cdot \log \sum_{k=1}^K \phi_k \end{align*}

Taking the derivative and setting it to zero, we obtain $$\frac{\phi_k}{\sum_l \phi_l} = \frac{n_k}{n}$$ for each $k$, where $n_k = |\{i : y^{(i)} = k\}|$ is the number of training targets with class $k$.

Thus, the optimal $\phi_k$ is just the proportion of data points with class $k$ in the training set!

# Optimizing Conditional Probabilities¶

Similarly, we can maximize the likelihood $$\max_{\mu_k, \Sigma_k} \sum_{i : y^{(i)} = k} \log P(x^{(i)} | y^{(i)} ; \mu_k, \Sigma_k) = \max_{\mu_k, \Sigma_k} \sum_{i : y^{(i)} = k} \log \mathcal{N}(x^{(i)} | \mu_k, \Sigma_k)$$ over the Gaussian parameters.

Computing the derivative and setting it to zero, we obtain closed form solutions: \begin{align*} \mu_k & = \frac{\sum_{i: y^{(i)} = k} x^{(i)}}{n_k} \\ \Sigma_k & = \frac{\sum_{i: y^{(i)} = k} (x^{(i)} - \mu_k)(x^{(i)} - \mu_k)^\top}{n_k} \\ \end{align*} These are just the empirical means and covariances of each class.

# Querying the Model¶

How do we ask the model for predictions? As discussed earler, we can apply Bayes' rule: $$\arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P(y).$$ Thus, we can estimate the probability of $x$ and under each $P_\theta(x|y=k)P(y=k)$ and choose the class that explains the data best.

# Classification Dataset: Iris Flowers¶

To demonstrate this approach, we are going to use the Iris flower dataset.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

If we only consider the first two feature columns, we can visualize the dataset in 2D.

# Example: Iris Flower Classification¶

Let's see how this approach can be used in practice on the Iris dataset.

• We will learn a good set of parameters for a Gaussian Discriminant model
• We will compare the outputs to the true predictions.

Let's first start by computing the true parameters on our dataset.

We can compute predictions using Bayes' rule.

We visualize the decision boundaries like we did earlier.