Lecture 17: Density Estimation

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Unsupervised Probabilistic Models

Density estimation is the problem of estimating a probability distribution from data.

As a first step, we will introduce probabilistic models for unsupervised learning.

Review: Unsupervised Learning

We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:

Components of an Unsupervised Learning Problem

At a high level, an unsupervised machine learning problem has the following structure:

$$ \underbrace{\text{Dataset}}_\text{Attributes} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Unsupervised Model} $$

The unsupervised model describes interesting structure in the data. For instance, it can identify interesting hidden clusters.

Review: Data Distribution

We will assume that the dataset is sampled from a probability distribution $P_\text{data}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$

The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.

Review: Unsupervised Models

We'll say that a model is a function $$ f : \mathcal{X} \to \mathcal{S} $$ that maps inputs $x \in \mathcal{X}$ to some notion of structure $s \in \mathcal{S}$.

Structure can have many definitions (clusters, low-dimensional representations, etc.), and we will see many examples.

Often, models have parameters $\theta \in \Theta$ living in a set $\Theta$. We will then write the model as $$ f_\theta : \mathcal{X} \to \mathcal{S} $$ to denote that it's parametrized by $\theta$.

Unsupervised Probabilistic Models

An unsupervised probabilistic model is a probability distribution $$P(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$.

Probabilistic models also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x) : \mathcal{X} \to [0,1].$$

Why Use Probabilistic Models?

There are many tasks that we can solve with a good model $P_\theta$.

  1. Generation: sample new objects from $P_\theta$, such as images.
  2. Representation learning: find interesting structure in $P_\text{data}$
  3. Density estimation: approximate $P_\theta \approx P_\text{data}$ and use it to solve any downstream task (generation, clustering, outlier detection, etc.).

We are going to be interested in the latter.

Kullback-Leibler Divergence

In order to approximate $P_\text{data}$ with $P_\theta$, we need a measure of distance between distributions.

A standard measure of similarity between distributions is the Kullback-Leibler (KL) divergence between two distributions $p$ and $q$, defined as $$ D(p \| q) = \sum_{{\bf x}} p({\bf x}) \log \frac{p({\bf x})}{q({\bf x})}. $$

Observations:

Learning Models Using KL Divergence

We may now learn a probabilistic model $P_\theta(x)$ that approximates $P_\text{data}(x)$ via the KL divergence: \begin{align*} D(P_{\textrm{data}} \mid \mid {P_\theta}) & = \mathbb{E}_{x \sim P_{\textrm{data}}}{\log\left( \frac{P_{\textrm{data}}(x)}{P_\theta(x)} \right)} \\ & = \sum_{{x}} P_{\textrm{data}}({x}) \log \frac{P_{\textrm{data}}({x})}{P_\theta(x)} \end{align*}

Note that $D(P_{\textrm{data}} \mid \mid {P_\theta})=0$ iff the two distributions are the same.

From KL Divergence to Log Likelihood

$ \newcommand{\x}{x} \newcommand{\ex}[2]{\mathbb{E}_{#1 \sim #2}} \newcommand{\en}[2]{D(#1 \mid \mid #2)} $

We can learn $P_\theta$ that approximates $P_\text{data}$ by minimizing $D(P_{\textrm{data}} \mid \mid {P_\theta})$. This objective further simplifies as: \begin{eqnarray*} \en{P{\textrm{data}}}{P\theta} &=& \ex{\x}{P{\textrm{data}}}{\log\left( \frac{P{\textrm{data}}(\x)}{P\theta(\x)} \right)} \ &=& %-\bH(P{\textrm{data}}) \ex{\x}{P{\textrm{data}}}{\log P{\textrm{data}}(\x)}

The first term does not depend on $P_\theta$: minimizing KL divergence is equivalent to maximizing the expected log-likelihood.

\begin{align*} \arg\min_{P_\theta} \en{P_{\textrm{data}}}{P_\theta} & = \arg\min_{P_\theta} - \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} \\ & = \arg\max_{P_\theta} \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} \end{align*}

Problem: In general we do not know $P_{\textrm{data}}$, hence expected value is intractable.

Maximum Likelihood Estimation

$ \newcommand{\exd}[2]{\mathbb{E}_{#1 \sim #2}} \newcommand{\cd}{\mathcal{D}} $

Applying, Monte Carlo estimation, we may approximate the expected log-likelihood $$ \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} $$ with the empirical log-likelihood: $$ \exd{\cd}{P_\theta(\x)} = \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x) $$

Maximum likelihood learning is then: $$ \max_{P_\theta} \hspace{2mm} \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x). $$

Example: Flipping a Random Coin

How should we choose $P_\theta(x)$ if 3 out of 5 coin tosses are heads? Let's apply maximum likelihood learning.

We optimize for $\theta$ which makes $\cd$ most likely. What is the solution in this case?

Part 2: Kernel Density Estimation

Next, let's look at a first example of probabilistic models and how they are used to perform density estimation.

Review: Data Distribution

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$

The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.

Review: Unsupervised Probabilistic Models

An unsupervised probabilistic model is a probability distribution $$P_\theta(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$. It may have parameters $\theta$.

Density Estimation

The problem of density estimation is to approximate the data distribution $P_\text{data}$ with the model $P$. $$ P \approx P_\text{data}. $$

It's also a general learning task. We can solve many downstream tasks using a good model $P$:

Histogram Density Estimation

Perhaps the simplest approach to density estimation is by forming a histogram.

A histogram partitions the input space $x$ into a $d$-dimensional grid and counts the number of points in each cell.

This is best illustrated by an example.

Let's start by creating a simple 1D dataset coming from a mixture of two Gaussians:

$$P_\text{data}(x) = 0.3 \cdot \mathcal{N}(x ; \mu=0, \sigma=1) + 0.7 \cdot \mathcal{N}(x ; \mu=5, \sigma=1)$$

We can now estimate the density using a histogram.

Limitations of Histograms

Histogram-based methods have a number of shortcomings.

We will now try to address the last two limitations.

Let's also visualize what we mean when we say that shape of the histogram depends on the histogram bins.

Kernel Density Estimation: Idea

Kernel density estimation (KDE) is a different approach to histogram estimation.

Tophat Kernel Density Estimation

The simplest form of this strategy (Tophat KDE) assumes a model of the form $$P_\delta(x) = \frac{N(x; \delta)}{n},$$ where $$ N(x; \delta) = |\{x^{(i)} : ||x^{(i)} - x || \leq \delta/2\}|, $$ is the number of points that are within a bin of with $\delta$ centered at $x$.

This is best understood via a picture.

The above algorithm still has the problem of producing a density estimate that is not smooth.

We are going to resolve this by replacing histogram counts with weighted averages.

Review: Kernels

A kernel function $K : \mathcal{X} \times \mathcal{X} \to [0, \infty]$ maps pairs of vectors $x, z \in \mathcal{X}$ to a real-valued score $K(x,z)$.

We will use the first interpretation here.

Kernel Density Estimation

A kernelized density model $P$ takes the form: $$P(x) \propto \sum_{i=1}^n K(x, x^{(i)}).$$ This can be interpreted in different ways:

Types of Kernels

We have seen several types of kernels in the context of support vector machines.

There are additional kernels that are popular for density estimation.

The following kernels are available in scikit-learn.

It's easier to understand these kernels by looking at a figure.

Kernel Density Estimation: Example

Let's look at an example in the context of the 1D points we have seen earlier.

We will fit a model of the form $$P(x) = \sum_{i=1}^n K(x, x^{(i)})$$ with a Gaussian kernel $K(x,z; \delta) \propto \exp(-||x-z||^2/2\delta^2)$.

KDE in Higher Dimensions

In priciple, kernel density estimation also works in higher dimensions.

However, the number of datapoints needed for a good fit incrases expoentially with the dimension, which limits the applications of this model in high dimensions.

Choosing Hyperparameters

Each kernel has a notion of "bandwidth" $\delta$. This is a hyperparameter that controls the "smoothness" of the fit.

Let's illustrate how the bandwidth affects smoothness via an example.

Algorithm: Kernel Density Estimation

Pros and Cons of KDE

Pros:

Cons:

Part 3: Latent Variable Models

Probabilistic models we have seen earlier often need to approximate complex distributions.

In order to make our models more expressive, we introduce additional structure in the form of latent variables.

Review: Probabilistic Models

An unsupervised probabilistic model is a probability distribution $$P(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$.

Probabilistic models also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x) : \mathcal{X} \to [0,1].$$

Review: Maximum Likelihood

In maximum likelihood learning, we maximize the empirical log-likelihood $$ \max_{P_\theta} \hspace{2mm} \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x), $$ where $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ is a dataset of independent and identicaly distributed (IID) samples from $P_\text{data}$.

Latent Variable Models: Motivation

Consider the following dataset of human faces.

Idea: Explicitly model these factors using latent variables $z$

Latent Variable Models: Definition

An latent-variable model is a probability distribution $$P_\theta(x, z) : \mathcal{X} \times \mathcal{Z} \to [0,1]$$

containing two sets of variables:

This model defines a $P_\theta(x) = \sum_{z \in \mathcal{Z}} P_\theta(x,z)$ that can approximate the data distribution $P_\text{data}(x)$.

Latent Variable Models: Example

Consider the following example of latent variables

Only shaded variables $x$ are observed in the data (pixel values). Latent variables $z$ correspond to high level features

Mixtures of Gaussians

A mixture of Gaussians is a probability distribution $P(x,z)$ that factorizes into two components:

Thus, $P_\theta(x,z)$ is a mixture of $K$ Gaussians: $$P_\theta(x,z) = \sum_{k=1}^K P_\theta(z=k) P_\theta(x|z=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

Mixtures of Gaussians fit more complex distributions than one Gaussian.

Raw data Single Gaussian Mixture of Gaussians

Representational Power of LVMs

An important reason for using LVMs is that they are more expressive models.

Feature Representations from LVMs

Given $P_\theta(x,z)$ we can compute $P_\theta(z|x)$ to find useful latent representations.

Latent variables are also useful to identify clusters in the data.

Learning Latent Variable Models

We can learn latent variable models using maximum likelihood: $$ \sum_{\x\in \cd} \log \Pr(\x ; \theta) = \sum_{\x\in \cd} \log \sum_{z \in \mathcal{Z}}\Pr(\x, z; \theta) $$

However, optimizing this objective is almost always intractable.

Approximate Inference in LVMs

In practice, we need to compute the likelihood objective (and its gradients) approximately.

Summary of LVMs

Latent-variable models are an important class of machine learning models.

They also have drawbacks: