Lecture 16: Unsupervised Learning

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: What is Unsupervised Learning?

Let's start by understanding what is unsupervised learning at a high level, starting with a dataset and an algorithm.

Unsupervised Learning

We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:

Components of Unsupervised Learning

At a high level, an unsupervised machine learning problem has the following structure:

$$ \text{Dataset} + \text{Algorithm} \to \text{Unsupervised Model} $$

The unsupervised model describes interesting structure in the data. For instance, it can identify interesting hidden clusters.

An Unsupervised Learning Dataset

As a first example of an unsupervised learning dataset, we will use our Iris flower example, but we will discard the labels.

We start by loading this dataset.

We can visualize this dataset in 2D. Note that we are no longer using label information.

An Unsupervised Learning Algorithm

We can use this dataset as input to a popular unsupervised learning algorithm, $K$-means.

Running $K$-means on this dataset identifies three clusters.

These clusters correspond to the three types of flowers found in the dataset, which we obtain from the labels.

Applications of Unsupervised Learning

Unsupervised learning has numerous applications:

Application: Discovering Structure in Digits

Unsupervised learning can discover structure in digits without any labels.

Application: DNA Analysis

Dimensionality reduction applied to DNA reveal the geography of European countries: