Lecture 6: Classification Algorithms

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Classification

So far, every supervised learning algorithm that we've seen has been an instance of regression.

Next, let's look at some classification algorithms. First, we will define what classification is.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

Regression vs. Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  1. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

Binary Classification

An important special case of classification is when the number of classes $K=2$.

In this case, we have an instance of a binary classification problem.

Classification Dataset: Iris Flowers

To demonstrate classification algorithms, we are going to use the Iris flower dataset.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

Here is a visualization of this dataset in 3D. Note that we are using the first 3 features (out of 4) in this dateset.

Understanding Classification

How is clasification different from regression?

Let's visualize our Iris dataset to see this. Note that we are using the first 2 features in this dateset.

Let's train a classification algorithm on this data.

Below, we see the regions predicted to be associated with the blue and non-blue classes and the line between them in the decision boundary.

Part 2: Nearest Neighbors

Previously, we have seen what defines a classification problem. Let's now look at our first classification algorithm.

Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

A Simple Classification Algorithm: Nearest Neighbors

Suppose we are given a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. At inference time, we receive a query point $x'$ and we want to predict its label $y'$.

A really simple but suprisingly effective way of returning $y'$ is the nearest neighbors approach.

In the example below on the Iris dataset, the red cross denotes the query $x'$. The closest class to it is "Virginica". (We're only using the first two features in the dataset for simplicity.)

Choosing a Distance Function

How do we select the point $x$ that is the closest to the query point $x'$? There are many options:

Let's apply Nearest Neighbors to the above dataset using the Euclidean distance (or equiavalently, Minkowski with $p=2$)

In the above example, the regions of the 2D space that are assigned to each class are highly irregular. In areas where the two classes overlap, the decision of the boundary flips between the classes, depending on which point is closest to it.

K-Nearest Neighbors

Intuitively, we expect the true decision boundary to be smooth. Therefore, we average $K$ nearest neighbors at a query point.

The consesus $y_\mathcal{N}$ can be determined by voting, weighted average, etc.

Let's look at Nearest Neighbors with a neighborhood of 30. The decision boundary is much smoother than before.

Review: Data Distribution

We will assume that the dataset is governed by a probability distribution $\mathbb{P}$, which we will call the data distribution. We will denote this as $$ x, y \sim \mathbb{P}. $$

The training set $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$ consists of independent and identicaly distributed (IID) samples from $\mathbb{P}$.

KNN Estimates Data Distribution

Suppose that the output $y'$ of KNN is the average target in the neighborhood $\mathcal{N}(x')$ around the query $x'$. Observe that we can write: $$y' = \frac{1}{K} \sum_{(x, y) \in \mathcal{N}(x')} y \approx \mathbb{E}[y \mid x'].$$

Algorithm: K-Nearest Neighbors

Pros and Cons of KNN

Pros:

Cons:

Part 3: Non-Parametric Models

Nearest neighbors is the first example of an important type of machine learning algorithm called a non-parametric model.

Review: Supervised Learning Model

We'll say that a model is a function $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

Often, models have parameters $\theta \in \Theta$ living in a set $\Theta$. We will then write the model as $$ f_\theta : \mathcal{X} \to \mathcal{Y} $$ to denote that it's parametrized by $\theta$.

Review: K-Nearest Neighbors

Suppose we are given a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. At inference time, we receive a query point $x'$ and we want to predict its label $y'$.

The consesus $y_\mathcal{N}$ can be determined by voting, weighted average, etc.

Non-Parametric Models

Nearest neighbors is an example of a non-parametric model. Parametric vs. non-parametric are is a key distinguishing characteristic for machine learning models.

A parametric model $f_\theta(x) : \mathcal{X} \times \Theta \to \mathcal{Y}$ is defined by a finite set of parameters $\theta \in \Theta$ whose dimensionality is constant with respect to the dataset. Linear models of the form $$ f_\theta(x) = \theta^\top x $$ are an example of a parametric model.

In a non-parametric model, the function $f$ uses the entire training dataset (or a post-proccessed version of it) to make predictions, as in $K$-Nearest Neighbors. In other words, the complexity of the model increases with dataset size.

Non-parametric models have the advantage of not loosing any information at training time. However, they are also computationally less tractable and may easily overfit the training set.

Algorithm: K-Nearest Neighbors

Part 4: Logistic Regression

Next, we are going to see a simple parametric classification algorithm that addresses many of these limitations of Nearest Neighbors.

Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

Binary Classification and the Iris Dataset

We are going to start by looking at binary (two-class) classification.

To keep things simple, we will use the Iris dataset. We will be predicting the difference between class 0 (Iris Setosa) and the other two classes.

Review: Least Squares

Recall that the linear regression algorithm fits a linear model of the form $$ f(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x. $$

It minimizes the mean squared error (MSE) $$J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2$$ on a dataset $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We could also use the above model for classification problem for which $\mathcal{Y} = \{0, 1\}$.

Least squares returns an acceptable decision boundary on this dataset. However, it is problematic for a few reasons.

The Logistic Function

To address this problem, we will look at a different hypothesis class. We will choose models of the form: $$ f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$ where $$ \sigma(z) = \frac{1}{1 + \exp(-z)} $$ is known as the sigmoid or logistic function.

The logistic function $\sigma : \mathbb{R} \to [0,1]$ "squeezes" points from the real line into $[0,1]$.

The Logistic Function: Properties

The sigmoid function is defined as $$ \sigma(z) = \frac{1}{1 + \exp(-z)}. $$ A few observations:

Let's implement our model using the sigmoid function.

Review: Probabilistic Least Squares

Recall that least squares can be interpreted as fitting a Gaussian probabilistic model $$ p(y | x; \theta) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left( -\frac{(y - \theta^\top x)^2}{2 \sigma^2} \right).$$

The log-likelihood of this model at a point $(x,y)$ equals \begin{align*} \log L(\theta) = \log p(y | x; \theta) = \text{const}_1 \cdot (y - \theta^\top x)^2 + \text{const}_2 \end{align*} for some constants $\text{const}_1, \text{const}_2$.

Least squares thus amounts to fitting a Gaussian $\mathcal{N}(y; \mu(x), \sigma)$ with a standard deviation $\sigma$ of one and a mean of $\mu(x) = \theta^\top x$.

A Probabilistic Approach to Classification

We can take this probabilistic perspective to derive a new algorithm for binary classification.

We will start by using our logistic model to parametrize a probability distribution as follows: \begin{align*} p(y=1 | x;\theta) & = \sigma(\theta^\top x) \\ p(y=0 | x;\theta) & = 1-\sigma(\theta^\top x). \end{align*} A probability over $y\in \{0,1\}$ of the form $P(y=1) = p$ is called Bernoulli.

Note that we can write this more compactly as \begin{align*} p(y | x;\theta) = \sigma(\theta^\top x)^y \cdot (1-\sigma(\theta^\top x))^{1-y} \end{align*}

Review: Conditional Maximum Likelihood

A general approach of optimizing conditional models of the form $P_\theta(y|x)$ is by minimizing expected KL divergence with respect to the data distribution: $$ \min_\theta \mathbb{E}_{x \sim \mathbb{P}_\text{data}} \left[ D(P_\text{data}(y|x) \mid\mid P_\theta(y|x)) \right]. $$

With a bit of math, we can show that the maximum likelihood objective becomes $$ \max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(y|x). $$ This is the principle of conditional maximum likelihood.

Applying Maximum Lilkelihood

Following the principle of maximum likelihood, we want to optimize the following objective defined over a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. \begin{align*} L(\theta) & = \prod_{i=1}^n p(y^{(i)} \mid x^{(i)} ; \theta) \\ & = \prod_{i=1}^n \sigma(\theta^\top x^{(i)})^{y^{(i)}} \cdot (1-\sigma(\theta^\top x^{(i)}))^{1-y^{(i)}}. \end{align*}

This log of this objective is also often called the log-loss, or cross-entropy.

This model and objective function define logistic regression, one of the most widely used classification algorithms (the name "regression" is an unfortunate misnomer!).

Let's implement the likelihood objective.

Review: Gradient Descent

If we want to optimize $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update: $$ \theta_i := \theta_{i-1} - \alpha \cdot \nabla_\theta J(\theta_{i-1}). $$

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * gradient(theta_prev)

Derivatives of the Log-Likelihood

Let's work out the gradient for our log likelihood objective:

\begin{align*} & \frac{\partial \log L(\theta)}{\partial \theta_j} = \frac{\partial}{\partial \theta_j} \log \left( \sigma(\theta^\top x) \cdot (1-\sigma(\theta^\top x))^{1-y} \right) \\ & = \left( y\cdot \frac{1}{\sigma(\theta^\top x)} - (1-y) \frac{1}{1-\sigma(\theta^\top x)} \right) \frac{\partial}{\partial \theta_j} \sigma(\theta^\top x) \\ & = \left( y\cdot \frac{1}{\sigma(\theta^\top x)} - (1-y) \frac{1}{1-\sigma(\theta^\top x)} \right) \sigma(\theta^\top x) (1-\sigma(\theta^\top x)) \frac{\partial}{\partial \theta_j} \theta^\top x \\ & = \left( y\cdot (1-\sigma(\theta^\top x)) - (1-y) \sigma(\theta^\top x) \right) x_j \\ & = \left( y - f_\theta(x) \right) x_j. \end{align*}

Gradient of the Log-Likelihood

Using the above expression, we obtain the following gradient: \begin{align*} \nabla_\theta J (\theta) = \left( y - f_\theta(x) \right) \cdot \bf{x}. \end{align*}

Let's implement the gradient.

Gradient Descent for Logistic Regression

Putting this together, we obtain a complete learning algorithm, logistic regression.

theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * (f(x, theta)-y) * x

Let's implement this algorithm.

Let's now visualize the result.

This is how we would use the algorithm via sklearn.

Observations About Logistic Regression

Algorithm: Logistic Regression

Part 5: Multi-Class Classification

Finally, let's look at an extension of logistic regression to an arbitrary number of classes.

Review: Logistic Regression

Logistic regression fits models of the form: $$ f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$ where $$ \sigma(z) = \frac{1}{1 + \exp(-z)} $$ is known as the sigmoid or logistic function.

Multi-Class Classification

Linear regression only applies to binary classification problems. What if we have an arbitrary number of classes $K$?

Let's load a fully multiclass version of the Iris dataset.

The Softmax Function

The logistic function $\sigma : \mathbb{R} \to [0,1]$ be seen as mapping input an input $\vec z\in\mathbb{R}$ to a probability.

Its multi-class extension $\vec \sigma : \mathbb{R}^K \to [0,1]^K$: maps a $K$-dimensional input $z\in\mathbb{R}$ to a $K$-dimensional vector of probabilities.

Each componnent of $\vec \sigma(\vec z)$ is defined as $$ \sigma(\vec z)_k = \frac{\exp(z_k)}{\sum_{l=1}^K \exp(z_l)}. $$ We call this the softmax function.

When $K=2$, this looks as follows: $$ \sigma(\vec z)_1 = \frac{\exp(z_1)}{\exp(z_1) + \exp(z_2)}. $$

We can assume that $\exp(z_1) = 1$ because multiplying the numerator and denominator doesn't change any of the probabilities (so we can just divide by $\exp(z_1)$). Thus we obtain: $$ \sigma(\vec z)_1 = \frac{1}{1 + \exp(z_2)}. $$

This is essentially our sigmoid function. Hence softmax generalizes the sigmoid function.

The Softmax Model

We can use the softmax function to define a $K$-class classification model.

In the binary classification setting, we mapped weights $\theta$ and features $x$ into a probability as follows: $$ \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$

In the multi-class setting, we define a model $f : \mathcal{X} \to [0,1]^K$ that outputs the probability of class $k$ based on the features $x$ and class-specific weights $\theta_k$: $$ \sigma(\theta_k^\top x)_k = \frac{\exp(\theta_k^\top x)}{\sum_{l=1}^K \exp(\theta_l^\top x)}. $$

Its parameter space lies in $\Theta^{K}$, where $\Theta = \mathbb{R}^d$ is the parameter space of logistic regression.

You may have noticed that this model is slightly over-parametrized: multiplying every $\theta_k$ by a constant results in an equivalent model. For this reason, it is often assumed that one of the class weights $\theta_l = 0$.

Softmax Regression

We again take a probabilistic perspective to derive a $K$-class classification algorithm based on this model.

We will start by using our softmax model to parametrize a probability distribution as follows: \begin{align*} p(y=k | x;\theta) & = \vec \sigma(\theta^\top x)_k \end{align*}

This is called a categorial distribution, and it generalizes the Bernoulli.

Following the principle of maximum likelihood, we want to optimize the following objective defined over a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. \begin{align*} L(\theta) & = \prod_{i=1}^n p(y^{(i)} \mid x^{(i)} ; \theta) = \prod_{i=1}^n \vec \sigma(\theta^\top x^{(i)})_{y^{(i)}} \end{align*}

This model and objective function define softmax regression. (The term "regression" here is again a misnomer.)

Let's now apply softmax regression to the Iris dataset by using the implementation from sklearn.

Algorithm: Softmax Regression