 # Lecture 17: Density Estimation¶

### Applied Machine Learning¶

Volodymyr Kuleshov
Cornell Tech

# Part 1: Unsupervised Probabilistic Models¶

Density estimation is the problem of estimating a probability distribution from data.

As a first step, we will introduce probabilistic models for unsupervised learning.

# Review: Unsupervised Learning¶

We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:

• Clusters hidden in the dataset.
• Outliers: particularly unusual and/or interesting datapoints.
• Useful signal hidden in noise, e.g. human speech over a noisy phone.

# Components of an Unsupervised Learning Problem¶

At a high level, an unsupervised machine learning problem has the following structure:

$$\underbrace{\text{Dataset}}_\text{Attributes} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Unsupervised Model}$$

The unsupervised model describes interesting structure in the data. For instance, it can identify interesting hidden clusters.

# Review: Data Distribution¶

We will assume that the dataset is sampled from a probability distribution $P_\text{data}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$

The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.

# Review: Unsupervised Models¶

We'll say that a model is a function $$f : \mathcal{X} \to \mathcal{S}$$ that maps inputs $x \in \mathcal{X}$ to some notion of structure $s \in \mathcal{S}$.

Structure can have many definitions (clusters, low-dimensional representations, etc.), and we will see many examples.

Often, models have parameters $\theta \in \Theta$ living in a set $\Theta$. We will then write the model as $$f_\theta : \mathcal{X} \to \mathcal{S}$$ to denote that it's parametrized by $\theta$.

# Unsupervised Probabilistic Models¶

An unsupervised probabilistic model is a probability distribution $$P(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$.

Probabilistic models also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x) : \mathcal{X} \to [0,1].$$

# Why Use Probabilistic Models?¶

There are many tasks that we can solve with a good model $P_\theta$.

1. Generation: sample new objects from $P_\theta$, such as images.
2. Representation learning: find interesting structure in $P_\text{data}$
3. Density estimation: approximate $P_\theta \approx P_\text{data}$ and use it to solve any downstream task (generation, clustering, outlier detection, etc.).

We are going to be interested in the latter.

# Kullback-Leibler Divergence¶

In order to approximate $P_\text{data}$ with $P_\theta$, we need a measure of distance between distributions.

A standard measure of similarity between distributions is the Kullback-Leibler (KL) divergence between two distributions $p$ and $q$, defined as $$D(p \| q) = \sum_{{\bf x}} p({\bf x}) \log \frac{p({\bf x})}{q({\bf x})}.$$

#### Observations:¶

• $D(p \, \| \, q) \geq 0$ for all $p, q$, with equality if and only if $p= q$. Proof: \begin{align*} D(p | q) = \mathbb{E}{x\sim p}{-\log \frac{q({\bf x})}{p({\bf x})}} & \geq -\log \left( \mathbb{E}{x\sim p} {\frac{q({\bf x})}{p({\bf x})}} \right) \ = &

# -\log \left( \sum_{{\bf x}} p({\bf x}) \frac{q({\bf x})}{p({\bf x})} \right)¶

0 \end{align*} where in the first line we used Jensen's inequality
• The KL-divergence is asymmetric, i.e., $D(p \| q) \neq D(q \| p)$

# Learning Models Using KL Divergence¶

We may now learn a probabilistic model $P_\theta(x)$ that approximates $P_\text{data}(x)$ via the KL divergence: \begin{align*} D(P_{\textrm{data}} \mid \mid {P_\theta}) & = \mathbb{E}_{x \sim P_{\textrm{data}}}{\log\left( \frac{P_{\textrm{data}}(x)}{P_\theta(x)} \right)} \\ & = \sum_{{x}} P_{\textrm{data}}({x}) \log \frac{P_{\textrm{data}}({x})}{P_\theta(x)} \end{align*}

Note that $D(P_{\textrm{data}} \mid \mid {P_\theta})=0$ iff the two distributions are the same.

# From KL Divergence to Log Likelihood¶

$\newcommand{\x}{x} \newcommand{\ex}{\mathbb{E}_{#1 \sim #2}} \newcommand{\en}{D(#1 \mid \mid #2)}$

We can learn $P_\theta$ that approximates $P_\text{data}$ by minimizing $D(P_{\textrm{data}} \mid \mid {P_\theta})$. This objective further simplifies as: \begin{eqnarray*} \en{P{\textrm{data}}}{P\theta} &=& \ex{\x}{P{\textrm{data}}}{\log\left( \frac{P{\textrm{data}}(\x)}{P\theta(\x)} \right)} \ &=& %-\bH(P{\textrm{data}}) \ex{\x}{P{\textrm{data}}}{\log P{\textrm{data}}(\x)}

• \ex{\x}{P{\textrm{data}}}{\log P\theta(\x)} \end{eqnarray*}

The first term does not depend on $P_\theta$: minimizing KL divergence is equivalent to maximizing the expected log-likelihood.

\begin{align*} \arg\min_{P_\theta} \en{P_{\textrm{data}}}{P_\theta} & = \arg\min_{P_\theta} - \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} \\ & = \arg\max_{P_\theta} \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} \end{align*}
• This asks that $P_\theta$ assign high probability to instances sampled from $P_{\textrm{data}}$, so as to reflect the true distribution.
• Because of $\log$, samples $\x$ where $P_\theta(\x) \approx 0$ weigh heavily in the objective.

Problem: In general we do not know $P_{\textrm{data}}$, hence expected value is intractable.

# Maximum Likelihood Estimation¶

$\newcommand{\exd}{\mathbb{E}_{#1 \sim #2}} \newcommand{\cd}{\mathcal{D}}$

Applying, Monte Carlo estimation, we may approximate the expected log-likelihood $$\ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)}$$ with the empirical log-likelihood: $$\exd{\cd}{P_\theta(\x)} = \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x)$$

Maximum likelihood learning is then: $$\max_{P_\theta} \hspace{2mm} \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x).$$

# Example: Flipping a Random Coin¶

How should we choose $P_\theta(x)$ if 3 out of 5 coin tosses are heads? Let's apply maximum likelihood learning.

• Our model is $P_\theta(x=H)=\theta$ and $P_\theta(x=T)=1-\theta$
• Our data is: $\cd=\{H,H,T,H,T\}$
• The likelihood of the data is $\prod_{i} P_\theta(x_i)=\theta \cdot \theta \cdot (1-\theta) \cdot \theta \cdot (1-\theta)$.

We optimize for $\theta$ which makes $\cd$ most likely. What is the solution in this case? # Part 2: Kernel Density Estimation¶

Next, let's look at a first example of probabilistic models and how they are used to perform density estimation.

# Review: Data Distribution¶

We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$

The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.

# Review: Unsupervised Probabilistic Models¶

An unsupervised probabilistic model is a probability distribution $$P_\theta(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$. It may have parameters $\theta$.

# Density Estimation¶

The problem of density estimation is to approximate the data distribution $P_\text{data}$ with the model $P$. $$P \approx P_\text{data}.$$

It's also a general learning task. We can solve many downstream tasks using a good model $P$:

• Outlier and novelty detection
• Generating new samples $x$
• Visualizing and understanding the structure of $P_\text{data}$

# Histogram Density Estimation¶

Perhaps the simplest approach to density estimation is by forming a histogram.

A histogram partitions the input space $x$ into a $d$-dimensional grid and counts the number of points in each cell.

This is best illustrated by an example.

Let's start by creating a simple 1D dataset coming from a mixture of two Gaussians:

$$P_\text{data}(x) = 0.3 \cdot \mathcal{N}(x ; \mu=0, \sigma=1) + 0.7 \cdot \mathcal{N}(x ; \mu=5, \sigma=1)$$

We can now estimate the density using a histogram.

# Limitations of Histograms¶

Histogram-based methods have a number of shortcomings.

• The number of grid cells increases exponentially with dimension $d$.
• The histogram is not "smooth".
• The shape of the histogram depends on the bin positions.

We will now try to address the last two limitations.

Let's also visualize what we mean when we say that shape of the histogram depends on the histogram bins.

# Kernel Density Estimation: Idea¶

Kernel density estimation (KDE) is a different approach to histogram estimation.

• A histogram has $b$ bins of width $\delta$ at fixed positions.
• KDE effectively places a bin of with $\delta$ at each $x \in \mathcal{X}$.
• To obtain $P(x)$, we count the % of points that fall in the bin centered at $x$.

# Tophat Kernel Density Estimation¶

The simplest form of this strategy (Tophat KDE) assumes a model of the form $$P_\delta(x) = \frac{N(x; \delta)}{n},$$ where $$N(x; \delta) = |\{x^{(i)} : ||x^{(i)} - x || \leq \delta/2\}|,$$ is the number of points that are within a bin of with $\delta$ centered at $x$.

This is best understood via a picture.

The above algorithm still has the problem of producing a density estimate that is not smooth.

We are going to resolve this by replacing histogram counts with weighted averages.

# Review: Kernels¶

A kernel function $K : \mathcal{X} \times \mathcal{X} \to [0, \infty]$ maps pairs of vectors $x, z \in \mathcal{X}$ to a real-valued score $K(x,z)$.

• A kernel represents the similarity between $x$ and $z$.
• It also often encodes the dot product between $x$ and $z$ in some high-dimensional feature space

We will use the first interpretation here.

# Kernel Density Estimation¶

A kernelized density model $P$ takes the form: $$P(x) \propto \sum_{i=1}^n K(x, x^{(i)}).$$ This can be interpreted in different ways:

• We count the number of points "near" $x$, but each $x^{(i)}$ has a weight $K(x, x^{(i)})$ that depends on similarity between $x, x^{(i)}$.
• We place a "micro-density" $K(x, x^{(i)})$ at each $x^{(i)}$; the final density $P(x)$ is their sum.

# Types of Kernels¶

We have seen several types of kernels in the context of support vector machines.

There are additional kernels that are popular for density estimation.

The following kernels are available in scikit-learn.

• Gaussian kernel $K(x,z; \delta) \propto \exp(-||x-z||^2/2\delta^2)$
• Tophat kernel $K(x,z; \delta) = 1 \text{ if } ||x-z|| \leq \delta/2$ else $0$.
• Epanechnikov kernel $K(x,z; \delta) \propto 1 - ||x-z||^2/\delta^2$
• Exponential kernel $K(x,z; \delta) \propto \exp(-||x-z||/\delta)$
• Linear kernel $K(x,z; \delta) \propto (1 - ||x-z||/\delta)^+$

It's easier to understand these kernels by looking at a figure.

# Kernel Density Estimation: Example¶

Let's look at an example in the context of the 1D points we have seen earlier.

We will fit a model of the form $$P(x) = \sum_{i=1}^n K(x, x^{(i)})$$ with a Gaussian kernel $K(x,z; \delta) \propto \exp(-||x-z||^2/2\delta^2)$.

# KDE in Higher Dimensions¶

In priciple, kernel density estimation also works in higher dimensions.

However, the number of datapoints needed for a good fit incrases expoentially with the dimension, which limits the applications of this model in high dimensions.

# Choosing Hyperparameters¶

Each kernel has a notion of "bandwidth" $\delta$. This is a hyperparameter that controls the "smoothness" of the fit.

• We can choose it using inspection or heuristics like we did for $K$ in $K$-Means.
• Because we have a probabilistic model, we can also estimate likelihood on a holdout dataset (more on this later!)

Let's illustrate how the bandwidth affects smoothness via an example.

# Algorithm: Kernel Density Estimation¶

• Type: Unsupervised learning (density estimation).
• Model family: Non-parametric. Sum of $n$ kernels.
• Objective function: Log-likelihood to choose optimal bandwidth.
• Optimizer: Grid search.

# Pros and Cons of KDE¶

Pros:

• Can approximate any data distribution arbtrarily well.

Cons:

• Need to store entire dataset to make queries, which is computationally prohibitive.
• Number of data needed scale exponentially with dimension ("curse of dimensionality"). # Part 3: Latent Variable Models¶

Probabilistic models we have seen earlier often need to approximate complex distributions.

In order to make our models more expressive, we introduce additional structure in the form of latent variables.

# Review: Probabilistic Models¶

An unsupervised probabilistic model is a probability distribution $$P(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$.

Probabilistic models also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x) : \mathcal{X} \to [0,1].$$

# Review: Maximum Likelihood¶

In maximum likelihood learning, we maximize the empirical log-likelihood $$\max_{P_\theta} \hspace{2mm} \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x),$$ where $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ is a dataset of independent and identicaly distributed (IID) samples from $P_\text{data}$.

# Latent Variable Models: Motivation¶

Consider the following dataset of human faces.