Density estimation is the problem of estimating a probability distribution from data.
As a first step, we will introduce probabilistic models for unsupervised learning.
We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:
At a high level, an unsupervised machine learning problem has the following structure:
$$ \underbrace{\text{Dataset}}_\text{Attributes} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Unsupervised Model} $$The unsupervised model describes interesting structure in the data. For instance, it can identify interesting hidden clusters.
We will assume that the dataset is sampled from a probability distribution $P_\text{data}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$
The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.
We'll say that a model is a function $$ f : \mathcal{X} \to \mathcal{S} $$ that maps inputs $x \in \mathcal{X}$ to some notion of structure $s \in \mathcal{S}$.
Structure can have many definitions (clusters, low-dimensional representations, etc.), and we will see many examples.
Often, models have parameters $\theta \in \Theta$ living in a set $\Theta$. We will then write the model as $$ f_\theta : \mathcal{X} \to \mathcal{S} $$ to denote that it's parametrized by $\theta$.
An unsupervised probabilistic model is a probability distribution $$P(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$.
Probabilistic models also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x) : \mathcal{X} \to [0,1].$$
There are many tasks that we can solve with a good model $P_\theta$.
We are going to be interested in the latter.
In order to approximate $P_\text{data}$ with $P_\theta$, we need a measure of distance between distributions.
A standard measure of similarity between distributions is the Kullback-Leibler (KL) divergence between two distributions $p$ and $q$, defined as $$ D(p \| q) = \sum_{{\bf x}} p({\bf x}) \log \frac{p({\bf x})}{q({\bf x})}. $$
We may now learn a probabilistic model $P_\theta(x)$ that approximates $P_\text{data}(x)$ via the KL divergence: \begin{align*} D(P_{\textrm{data}} \mid \mid {P_\theta}) & = \mathbb{E}_{x \sim P_{\textrm{data}}}{\log\left( \frac{P_{\textrm{data}}(x)}{P_\theta(x)} \right)} \\ & = \sum_{{x}} P_{\textrm{data}}({x}) \log \frac{P_{\textrm{data}}({x})}{P_\theta(x)} \end{align*}
Note that $D(P_{\textrm{data}} \mid \mid {P_\theta})=0$ iff the two distributions are the same.
$ \newcommand{\x}{x} \newcommand{\ex}[2]{\mathbb{E}_{#1 \sim #2}} \newcommand{\en}[2]{D(#1 \mid \mid #2)} $
We can learn $P_\theta$ that approximates $P_\text{data}$ by minimizing $D(P_{\textrm{data}} \mid \mid {P_\theta})$. This objective further simplifies as: \begin{eqnarray*} \en{P{\textrm{data}}}{P\theta} &=& \ex{\x}{P{\textrm{data}}}{\log\left( \frac{P{\textrm{data}}(\x)}{P\theta(\x)} \right)} \ &=& %-\bH(P{\textrm{data}}) \ex{\x}{P{\textrm{data}}}{\log P{\textrm{data}}(\x)}
The first term does not depend on $P_\theta$: minimizing KL divergence is equivalent to maximizing the expected log-likelihood.
\begin{align*} \arg\min_{P_\theta} \en{P_{\textrm{data}}}{P_\theta} & = \arg\min_{P_\theta} - \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} \\ & = \arg\max_{P_\theta} \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} \end{align*}Problem: In general we do not know $P_{\textrm{data}}$, hence expected value is intractable.
$ \newcommand{\exd}[2]{\mathbb{E}_{#1 \sim #2}} \newcommand{\cd}{\mathcal{D}} $
Applying, Monte Carlo estimation, we may approximate the expected log-likelihood $$ \ex{\x}{P_{\textrm{data}}}{\log P_\theta(\x)} $$ with the empirical log-likelihood: $$ \exd{\cd}{P_\theta(\x)} = \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x) $$
Maximum likelihood learning is then: $$ \max_{P_\theta} \hspace{2mm} \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x). $$
How should we choose $P_\theta(x)$ if 3 out of 5 coin tosses are heads? Let's apply maximum likelihood learning.
We optimize for $\theta$ which makes $\cd$ most likely. What is the solution in this case?
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
# our dataset is {H, H, T, H, T}; if theta = P(x=H), we get:
coin_likelihood = lambda theta: theta*theta*(1-theta)*theta*(1-theta)
theta_vals = np.linspace(0,1)
plt.plot(theta_vals, coin_likelihood(theta_vals))
[<matplotlib.lines.Line2D at 0x1168f7390>]
Next, let's look at a first example of probabilistic models and how they are used to perform density estimation.
We will assume that the dataset is sampled from a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$
The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.
An unsupervised probabilistic model is a probability distribution $$P_\theta(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$. It may have parameters $\theta$.
The problem of density estimation is to approximate the data distribution $P_\text{data}$ with the model $P$. $$ P \approx P_\text{data}. $$
It's also a general learning task. We can solve many downstream tasks using a good model $P$:
Perhaps the simplest approach to density estimation is by forming a histogram.
A histogram partitions the input space $x$ into a $d$-dimensional grid and counts the number of points in each cell.
This is best illustrated by an example.
Let's start by creating a simple 1D dataset coming from a mixture of two Gaussians:
$$P_\text{data}(x) = 0.3 \cdot \mathcal{N}(x ; \mu=0, \sigma=1) + 0.7 \cdot \mathcal{N}(x ; \mu=5, \sigma=1)$$# https://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html
import numpy as np
np.random.seed(1)
N = 20 # number of points
# concat samples from two Gaussians:
X = np.concatenate((
np.random.normal(0, 1, int(0.3 * N)),
np.random.normal(5, 1, int(0.7 * N))
))[:, np.newaxis]
bins = np.linspace(-5, 10, 10) # locations of the bins
# print out X
print(X.flatten())
[ 1.62434536 -0.61175641 -0.52817175 -1.07296862 0.86540763 -2.3015387 6.74481176 4.2387931 5.3190391 4.75062962 6.46210794 2.93985929 4.6775828 4.61594565 6.13376944 3.90010873 4.82757179 4.12214158 5.04221375 5.58281521]
We can now estimate the density using a histogram.
import matplotlib.pyplot as plt
plt.hist(X[:, 0], bins=bins, density=True) # plot the histogram
plt.plot(X[:, 0], np.full(X.shape[0], -0.01), '.k', markersize=10) # plot the points in X
plt.xlim(-4, 9)
plt.ylim(-0.02, 0.25)
(-0.02, 0.25)
Histogram-based methods have a number of shortcomings.
We will now try to address the last two limitations.
Let's also visualize what we mean when we say that shape of the histogram depends on the histogram bins.
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12,4))
ax[0].hist(X[:, 0], bins=bins, density=True) # plot the histogram
ax[1].hist(X[:, 0], bins=bins+0.75, density=True) # plot the histogram with bin centers shifted by 0.75
for axi in ax.ravel():
axi.plot(X[:, 0], np.full(X.shape[0], -0.01), '.k', markersize=10) # plot the points in X
axi.set_xlim(-4, 9)
axi.set_ylim(-0.02, 0.3)
Kernel density estimation (KDE) is a different approach to histogram estimation.
The simplest form of this strategy (Tophat KDE) assumes a model of the form $$P_\delta(x) = \frac{N(x; \delta)}{n},$$ where $$ N(x; \delta) = |\{x^{(i)} : ||x^{(i)} - x || \leq \delta/2\}|, $$ is the number of points that are within a bin of with $\delta$ centered at $x$.
This is best understood via a picture.
from sklearn.neighbors import KernelDensity
kde = KernelDensity(kernel='tophat', bandwidth=0.75).fit(X) # fit a KDE model
x_ticks = np.linspace(-5, 10, 1000)[:, np.newaxis] # choose 1000 points on x-axis
log_density = kde.score_samples(x_ticks) # compute density at 1000 points
plt.fill(x_ticks[:, 0], np.exp(log_density)) # plot the density estimate
plt.plot(X[:, 0], np.full(X.shape[0], -0.01), '.k', markersize=10) # plot the points in X
plt.xlim(-4, 9)
plt.ylim(-0.02, 0.32)
(-0.02, 0.32)
The above algorithm still has the problem of producing a density estimate that is not smooth.
We are going to resolve this by replacing histogram counts with weighted averages.
A kernel function $K : \mathcal{X} \times \mathcal{X} \to [0, \infty]$ maps pairs of vectors $x, z \in \mathcal{X}$ to a real-valued score $K(x,z)$.
We will use the first interpretation here.
A kernelized density model $P$ takes the form: $$P(x) \propto \sum_{i=1}^n K(x, x^{(i)}).$$ This can be interpreted in different ways:
We have seen several types of kernels in the context of support vector machines.
There are additional kernels that are popular for density estimation.
The following kernels are available in scikit-learn
.
It's easier to understand these kernels by looking at a figure.
# https://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html
X_plot = np.linspace(-6, 6, 1000)[:, None]
X_src = np.zeros((1, 1))
fig, ax = plt.subplots(2, 3, sharex=True, sharey=True, figsize=(12,4))
fig.subplots_adjust(left=0.05, right=0.95, hspace=0.05, wspace=0.05)
def format_func(x, loc):
if x == 0:
return '0'
elif x == 1:
return '$\delta/2$'
elif x == -1:
return '-$\delta/2$'
else:
return '%i$\delta$' % (int(x/2))
for i, kernel in enumerate(['gaussian', 'tophat', 'epanechnikov',
'exponential', 'linear', 'cosine']):
axi = ax.ravel()[i]
log_dens = KernelDensity(kernel=kernel).fit(X_src).score_samples(X_plot)
axi.fill(X_plot[:, 0], np.exp(log_dens), '-k', fc='#AAAAFF')
axi.text(-2.6, 0.95, kernel)
axi.xaxis.set_major_formatter(plt.FuncFormatter(format_func))
axi.xaxis.set_major_locator(plt.MultipleLocator(1))
axi.yaxis.set_major_locator(plt.NullLocator())
axi.set_ylim(0, 1.05)
axi.set_xlim(-2.9, 2.9)
ax[0, 1].set_title('Available Kernels')
Text(0.5, 1.0, 'Available Kernels')
Let's look at an example in the context of the 1D points we have seen earlier.
We will fit a model of the form $$P(x) = \sum_{i=1}^n K(x, x^{(i)})$$ with a Gaussian kernel $K(x,z; \delta) \propto \exp(-||x-z||^2/2\delta^2)$.
from sklearn.neighbors import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) # fit a KDE model
x_ticks = np.linspace(-5, 10, 1000)[:, np.newaxis] # choose 1000 points on x-axis
log_density = kde.score_samples(x_ticks) # compute density at 1000 points
gaussian_kernel = lambda z : lambda x: np.exp(-np.abs(x-z)**2/(0.75**2)) # gaussian kernel
kernel_linspace = lambda x : np.linspace(x-1.2,x+1.2,30)
plt.figure(figsize=(12,4))
plt.plot(x_tick s[:, 0], np.exp(log_density)) # plot the density estimate
plt.plot(X[:, 0], np.full(X.shape[0], -0.01), '.k', markersize=10) # plot the points in X
plt.plot(kernel_linspace(4), 0.07*gaussian_kernel(4)(kernel_linspace(4)), '--', color='r', alpha=0.75)
plt.plot(kernel_linspace(5), 0.07*gaussian_kernel(5)(kernel_linspace(5)), '--', color='r', alpha=0.75)
plt.plot(kernel_linspace(1), 0.07*gaussian_kernel(1)(kernel_linspace(1)), '--', color='r', alpha=0.75)
plt.xlim(-4, 9)
plt.ylim(-0.02, 0.32)
(-0.02, 0.32)
In priciple, kernel density estimation also works in higher dimensions.
However, the number of datapoints needed for a good fit incrases expoentially with the dimension, which limits the applications of this model in high dimensions.
Each kernel has a notion of "bandwidth" $\delta$. This is a hyperparameter that controls the "smoothness" of the fit.
Let's illustrate how the bandwidth affects smoothness via an example.
from sklearn.neighbors import KernelDensity
kde1 = KernelDensity(kernel='gaussian', bandwidth=3).fit(X) # fit a KDE model
kde2 = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X) # fit a KDE model
fig, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(12,4))
ax[0].fill(x_ticks[:, 0], np.exp(kde1.score_samples(x_ticks))) # plot the density estimate
ax[1].fill(x_ticks[:, 0], np.exp(kde2.score_samples(x_ticks))) # plot the density estimate
ax[0].set_title('Bandwidth Too High')
ax[1].set_title('Bandwidth Too Low')
for axi in ax.ravel():
axi.plot(X[:, 0], np.full(X.shape[0], -0.01), '.k', markersize=10) # plot the points in X
axi.set_xlim(-4, 9)
axi.set_ylim(-0.02, 0.4)
Pros:
Cons:
Probabilistic models we have seen earlier often need to approximate complex distributions.
In order to make our models more expressive, we introduce additional structure in the form of latent variables.
An unsupervised probabilistic model is a probability distribution $$P(x) : \mathcal{X} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}$.
Probabilistic models also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x) : \mathcal{X} \to [0,1].$$
In maximum likelihood learning, we maximize the empirical log-likelihood $$ \max_{P_\theta} \hspace{2mm} \frac{1}{|\cd|}\sum_{\x\in \cd} \log P_\theta(\x), $$ where $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ is a dataset of independent and identicaly distributed (IID) samples from $P_\text{data}$.
Consider the following dataset of human faces.
Idea: Explicitly model these factors using latent variables $z$
An latent-variable model is a probability distribution $$P_\theta(x, z) : \mathcal{X} \times \mathcal{Z} \to [0,1]$$
containing two sets of variables:
This model defines a $P_\theta(x) = \sum_{z \in \mathcal{Z}} P_\theta(x,z)$ that can approximate the data distribution $P_\text{data}(x)$.
Consider the following example of latent variables
Only shaded variables $x$ are observed in the data (pixel values). Latent variables $z$ correspond to high level features
A mixture of Gaussians is a probability distribution $P(x,z)$ that factorizes into two components:
Thus, $P_\theta(x,z)$ is a mixture of $K$ Gaussians: $$P_\theta(x,z) = \sum_{k=1}^K P_\theta(z=k) P_\theta(x|z=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$
Mixtures of Gaussians fit more complex distributions than one Gaussian.
Raw data | Single Gaussian | Mixture of Gaussians |
---|---|---|
An important reason for using LVMs is that they are more expressive models.
Given $P_\theta(x,z)$ we can compute $P_\theta(z|x)$ to find useful latent representations.
Latent variables are also useful to identify clusters in the data.
We can learn latent variable models using maximum likelihood: $$ \sum_{\x\in \cd} \log \Pr(\x ; \theta) = \sum_{\x\in \cd} \log \sum_{z \in \mathcal{Z}}\Pr(\x, z; \theta) $$
However, optimizing this objective is almost always intractable.
In practice, we need to compute the likelihood objective (and its gradients) approximately.
Latent-variable models are an important class of machine learning models.
They also have drawbacks: