Lecture 10: Dual Formulation of Support Vector Machines

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Lagrange Duality

In this lecture, we continue looking at Support Vector Machines (SVMs), and define a new formulation of the max-margin problem.

Before we do that, we start with a general concept -- Lagrange duality.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

Review: Maximizing the Margin

We saw that maximizing the margin of a linear model amounts to solving the following optimization problem. \begin{align*} \min_{\theta,\theta_0} \frac{1}{2}||\theta||^2 \; & \\ \text{subject to } \; & y^{(i)}((x^{(i)})^\top\theta+\theta_0)\geq 1 \; \text{for all $i$} \end{align*}

We are going to look at a different way of optimizing this objective. But first, we start by defining Lagrange duality.

Constrained Optimization Problems

We will look at constrained optimization problems of the form \begin{align*} \min_{\theta \in \mathbb{R}^d} \; & J(\theta) \\ \text{such that } \; & c_k(\theta) \leq 0 \text{ for $k =1,2,\ldots,K$} \end{align*} where $J(\theta)$ is the optimization objective and each $c_k(\theta) : \mathbb{R}^d \to \mathbb{R}$ is a constraint.

Our goal is to find a small value of $J(\theta)$ such that the $c_k(\theta)$ are negative.

Optimization with Penalties

Another way of approaching the above goal is via: $$\min_\theta \mathcal{L}(\theta, \lambda) = J(\theta) + \sum_{k=1}^K \lambda_k c_k(\theta)$$ for some positive vector of Lagrange multipliers $\lambda \in [0, \infty)^K$. We call $\mathcal{L}(\theta, \lambda)$ the Lagrangian.

Penalties are another way of enforcing constraints.

Penalties vs. Constraints

Penalites and constraints are closely related. Consider our constrained optimization problem: \begin{align*} \min_{\theta \in \mathbb{R}^d} \; & J(\theta) \\ \text{such that } \; & c_k(\theta) \leq 0 \text{ for $k =1,2,\ldots,K$} \end{align*}

We define its primal Lagrange form to be $$\min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta) = \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \left(J(\theta) + \sum_{k=1}^K \lambda_k c_k(\theta) \right)$$

These two forms have the same optimum $\theta^*$!

Why is this true? Consider again $$\min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta) = \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \left(J(\theta) + \sum_{k=1}^K \lambda_k c_k(\theta) \right)$$

Thus, $\min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta)$ is the solution to our initial optimization problem.

Langrange Dual

Now consider the following problem over $\lambda\geq 0$: $$\max_{\lambda \geq 0}\mathcal{D}(\lambda) = \max_{\lambda \geq 0} \min_{\theta \in \mathbb{R}^d} \mathcal{L}(\theta, \lambda) = \max_{\lambda \geq 0} \min_{\theta \in \mathbb{R}^d} \left(J(\theta) + \sum_{k=1}^K \lambda_k c_k(\theta) \right).$$

We call this the Lagrange dual of the primal optimization problem $\min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta)$. We can always construct a dual for the primal.

Lagrange Duality

The dual interesting because we always have: $$ \max_{\lambda \geq 0}\mathcal{D}(\lambda) = \max_{\lambda \geq 0} \min_{\theta \in \mathbb{R}^d} \leq \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = \min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta)$$

Moreover, in many interesting cases, we have $$ \max_{\lambda \geq 0}\mathcal{D}(\lambda) = \min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta).$$ Thus, the primal and the dual are equivalent!

Example: Regularization

Consider regularized supervised laerning problem with a penalty term: $$ \min_{\theta \in \Theta} L(\theta) + \lambda \cdot R(\theta). $$

We may also enforce an explicit constraint on the complexity of the model: \begin{align*} \min_{\theta \in \Theta} \; & L(\theta) \\ \text{such that } \; & R(\theta) \leq \lambda' \end{align*} We will not prove this, but solving this problem is equivalent so solving the penalized problem for some $\lambda > 0$ that's different from $\lambda'$.

In other words, we can regularize by explicitly enforcing $R(\theta)$ to be less than a value or we can penalize $R(\theta)$.

We are now going to see another application of Lagrangians in the context of SVMs.

Part 2: Dual Formulation of SVMs

Let's now apply Lagrange duality to support vector machines.

Review: Binary Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Binary Classification: The target variable $y$ is discrete and takes on one of $K=2$ possible values.

In this lecture, we assume $\mathcal{Y} = \{-1, +1\}$.

Review: Linear Model Family

In this lecture, we will work with linear models of the form: \begin{align*} f_\theta(x) & = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2 + ... + \theta_d \cdot x_d \end{align*} where $x \in \mathbb{R}^d$ is a vector of features and $y \in \{-1, 1\}$ is the target. The $\theta_j$ are the parameters of the model.

We can represent the model in a vectorized form \begin{align*} f_\theta(x) = \theta^\top x + \theta_0. \end{align*}

Review: Geometric Margin

We define the geometric margin $\gamma^{(i)}$ with respect to a training example $(x^{(i)}, y^{(i)})$ as $$ \gamma^{(i)} = y^{(i)}\left( \frac{\theta^\top x^{(i)} + \theta_0}{||\theta||} \right). $$ This also corresponds to the distance from $x^{(i)}$ to the hyperplane.

Review: Maximizing the Margin

We saw that maximizing the margin of a linear model amounts to solving the following optimization problem. \begin{align*} \min_{\theta,\theta_0} \frac{1}{2}||\theta||^2 \; & \\ \text{subject to } \; & y^{(i)}((x^{(i)})^\top\theta+\theta_0)\geq 1 \; \text{for all $i$} \end{align*}

We are going to look at a different way of optimizing this objective. But first, we start by defining Lagrange duality.

Review: Penalties vs. Constraints

Penalites and constraints are closely related. Consider our constrained optimization problem: \begin{align*} \min_{\theta \in \mathbb{R}^d} \; & J(\theta) \\ \text{such that } \; & c_k(\theta) \leq 0 \text{ for $k =1,2,\ldots,K$} \end{align*}

We define its primal Lagrange form to be $$\min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta) = \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \left(J(\theta) + \sum_{k=1}^K \lambda_k c_k(\theta) \right)$$

These two forms have the same optimum $\theta^*$!

The Lagrangian of the SVM Problem

Consider the following objective, the Langrangian of the max-margin optimization problem.

\begin{align*} L(\theta, \theta_0, \lambda) = \frac{1}{2}||\theta||^2 + \sum_{i=1}^n \lambda_i \left(1-y^{(i)}((x^{(i)})^\top\theta+\theta_0)\right) \end{align*}

Intuitively, we have put each constraint inside the objective function and added a penalty $\lambda_i$ to it.

Review: Langrange Dual

Consider the following problem over $\lambda\geq 0$: $$\max_{\lambda \geq 0}\mathcal{D}(\lambda) = \max_{\lambda \geq 0} \min_{\theta \in \mathbb{R}^d} \mathcal{L}(\theta, \lambda) = \max_{\lambda \geq 0} \min_{\theta \in \mathbb{R}^d} \left(J(\theta) + \sum_{k=1}^K \lambda_k c_k(\theta) \right).$$

We call this the Lagrange dual of the primal optimization problem $\min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta)$. We can always construct a dual for the primal.

The Dual of the SVM Problem

Consider optimizing the above Lagrangian over $\theta, \theta_0$ for any value of $\lambda$. $$\min_{\theta, \theta_0} L(\theta, \theta_0, \lambda) = \min_{\theta, \theta_0} \left( \frac{1}{2}||\theta||^2 + \sum_{i=1}^n \lambda_i \left(1-y^{(i)}((x^{(i)})^\top\theta+\theta_0)\right)\right)$$ This objective is quadratic in $\theta$; hence it has a single minimum in $\theta$.

We can find it by setting the derivative to zero and solving for $\theta, \theta_0$. This yields: \begin{align*} \theta & = \sum_{i=1}^n \lambda_i y^{(i)} x^{(i)} \\ 0 & = \sum_{i=1}^n \lambda_i y^{(i)} \end{align*}

Substituting this into the Langrangian we obtain: \begin{align*} L(\lambda) = \max_{\theta, \theta_0} L(\theta, \theta_0, \lambda) & = \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \end{align*} as well as $0 = \sum_{i=1}^n \lambda_i y^{(i)}$ and $\lambda_i \geq 0$ for all $i$.

Substituting this into the Langrangian we obtain the following expression for the dual $\max_{\lambda\geq 0} \mathcal{D}(\lambda) = \max_{\lambda\geq 0} \min_{\theta, \theta_0} L(\theta, \theta_0, \lambda)$: \begin{align*} \max_{\lambda} & \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \\ \text{subject to } \; & \sum_{i=1}^n \lambda_i y^{(i)} = 0 \\ & \lambda_i \geq 0 \; \text{for all $i$} \end{align*}

Lagrange Duality in SVMs

Recall that in general, we have: $$ \max_{\lambda \geq 0}\mathcal{D}(\lambda) = \max_{\lambda \geq 0} \min_{\theta \in \mathbb{R}^d} \leq \min_{\theta \in \mathbb{R}^d} \max_{\lambda \geq 0} \mathcal{L}(\theta, \lambda) = \min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta)$$

In the case of the SVM problem, one can show that $$ \max_{\lambda \geq 0}\mathcal{D}(\lambda) = \min_{\theta \in \mathbb{R}^d} \mathcal{P}(\theta).$$ Thus, the primal and the dual are equivalent!

Properties of the Dual

We can make several observations about the dual \begin{align*} \max_{\lambda} & \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \\ \text{subject to } \; & \sum_{i=1}^n \lambda_i y^{(i)} = 0 \;\text{and}\; \lambda_i \geq 0 \; \text{for all $i$} \end{align*}

When to Solve the Dual

When should we be solving the dual or the primal?

In the next lecture, we will see how we can use this property to solve machine learning problems with a very large number of features (even possibly infinite!).

Part 3: Practical Considerations for SVM Duals

We continue our discussion of the dual formulation of the SVM with additional practical details about the dual formulation is defined an used.

Review: Binary Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Binary Classification: The target variable $y$ is discrete and takes on one of $K=2$ possible values.

In this lecture, we assume $\mathcal{Y} = \{-1, +1\}$.

Review: Primal and Dual Formulations

Recall that the the max-margin hyperplane can be formualted as the solution to the following primal optimization problem. \begin{align*} \min_{\theta,\theta_0} \frac{1}{2}||\theta||^2 \; & \\ \text{subject to } \; & y^{(i)}((x^{(i)})^\top\theta+\theta_0)\geq 1 \; \text{for all $i$} \end{align*}

The solution to this problem also happens to be given by the following dual problem: \begin{align*} \max_{\lambda} & \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \\ \text{subject to } \; & \sum_{i=1}^n \lambda_i y^{(i)} = 0 \\ & \lambda_i \geq 0 \; \text{for all $i$} \end{align*}

Review: Non-Separable Problems

Our dual problem assumes that a linear hyperplane exists. However, what if the classes are non-separable? Then our optimization problem does not have a solution and we need to modify it.

Our solution is going to be to make each constraint "soft", by introducing "slack" variables, which allow the constraint to be violated. $$ y^{(i)}((x^{(i)})^\top\theta+\theta_0)\geq 1 - \xi_i. $$

In the optimization problem, we assign a penalty $C$ to these slack variables to obtain: \begin{align*} \min_{\theta,\theta_0, \xi}\; & \frac{1}{2}||\theta||^2 + C \sum_{i=1}^n \xi_i \; \\ \text{subject to } \; & y^{(i)}((x^{(i)})^\top\theta+\theta_0)\geq 1 - \xi_i \; \text{for all $i$} \\ & \xi_i \geq 0 \end{align*}

This is the primal problem. Let's now form its dual.

Non-Separable Dual

We can also formulate the dual to this problem. First, the Lagrangian $L(\lambda, \mu,\theta,\theta_0)$ equals \begin{align*} \frac{1}{2}||\theta||^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \lambda_i \left(y^{(i)}((x^{(i)})^\top\theta+\theta_0)- 1\right) - \sum_{i=1}^n \mu_i\xi_i. \end{align*}

The dual objective of this problem will equal $$\mathcal{D}(\lambda, \mu) = \min_{\theta,\theta_0} L(\lambda, \mu,\theta,\theta_0).$$

As earlier, we can solve for the optimal $\theta, \theta_0$ in closed form and plug back the resulting values into the objective.

We can then show that the dual takes the following form: \begin{align*} \max_{\lambda} & \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \\ \text{subject to } \; & \sum_{i=1}^n \lambda_i y^{(i)} = 0 \\ & C \geq \lambda_i \geq 0 \; \text{for all $i$} \end{align*}

Coordinate Descent

Coordinate descent is a general way to optimize functions $f(x)$ of multiple variables $x \in \mathbb{R}^d$:

  1. Choose a dimension $j \in \{1,2,\ldots,d\}$.
  2. Optimize $f(x_1, x_2, \ldots, x_j, \ldots, x_d)$ over $x_j$ while keeping the other variables fixed.

Here, we visualize coordinate descent applied to a 2D quadratic function.

The red line shows the trajectory of coordinate descent. Each "step" in the trajectory is an iteration of the algorithm. Image from Wikipedia.

Sequential Minimal Optimization

We can apply a form of coordinate descent to solve the dual: \begin{align*} \max_{\lambda} & \sum_{i=1}^n \lambda_i - \frac{1}{2} \sum_{i=1}^n \sum_{k=1}^n \lambda_i \lambda_k y^{(i)} y^{(k)} (x^{(i)})^\top x^{(k)} \\ \text{subject to } \; & \sum_{i=1}^n \lambda_i y^{(i)} = 0 \;\text{and}\; C \geq \lambda_i \geq 0 \; \text{for all $i$} \end{align*}

A popular, efficient algorithm is Sequential Minimal Optimization (SMO):

Obtaining a Primal Solution from the Dual

Next, assuming we can solve the dual, how do we find a separating hyperplane $\theta, \theta_0$?

Recall that we already found an expression for the optimal $\theta^*$ (in the separable case) as a function of $\lambda$: $$ \theta^* = \sum_{i=1}^n \lambda_i y^{(i)} x^{(i)}. $$

Once we know $\theta^*$ it easy to check that the solution to $\theta_0$ is given by $$ \theta_0^* = -\frac{\max_{i:y^{(i)}=-1} (\theta^*)^\top x^{(i)} + \min_{i:y^{(i)}=-1} (\theta^*)^\top x^{(i)}}{2}. $$

Support Vectors

A powerful property of the SVM dual is that at the optimum, most variables $\lambda_i$ are zero! Thus, $\theta$ is a sum of a small number of points: $$ \theta^* = \sum_{i=1}^n \lambda_i y^{(i)} x^{(i)}. $$

The points for which $\lambda_i > 0$ are precisely the points that lie on the margin (are closest to the hyperplane).

These are called support vectors.

Notation and The Iris Dataset

To demonstrate how to use the dual version of the SVM, we are going to again use the Iris flower dataset.

We will look at the binary classificaiton version of this dataset.

Let's visualize this dataset.

We can run the dual version of the SVM by importing an implementation from sklearn:

Algorithm: Support Vector Machine Classification (Dual Form)