Clustering is a common unsupervised learning problem with numerous applications.
We will start by defining the problem and outlining some models for this problem.
We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:
We will assume that the dataset is sampled from a probability distribution $P_\text{data}$, which we will call the data distribution. We will denote this as $$x \sim P_\text{data}.$$
The dataset $\mathcal{D} = \{x^{(i)} \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $P_\text{data}$.
Clustering is the problem of identifying distinct components in the data distribution.
$K$-Means is the simplest example of a clustering algorithm.
We seek centroids $c_k$ such that the distance between the points and their closest centroid is minimized: $$J(\theta) = \sum_{i=1}^n || x^{(i)} - \text{centroid}(f_\theta(x^{(i)})) ||,$$ where $\text{centroid}(k) = c_k$ denotes the centroid for cluster $k$.
This is best illustrated visually (from Wikipedia):