{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "i_f5u2x9nn6I", "slideshow": { "slide_type": "slide" } }, "source": [ "# **Lecture 12: Support Vector Machines**\n", "In this lecture, we are going to cover support vector machines (SVMs), one the most successful classification algorithms in machine learning." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 12.1. Classification Margins\n", "\n", "We start the presentation of SVMs by defining the classification *margin*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.1.1. Review and Motivation \n", "\n", "### 12.1.1.1. Review of Binary Classification\n", "\n", "Consider a training dataset $\\mathcal{D} = \\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \\ldots, (x^{(n)}, y^{(n)})\\}$.\n", "Recall that we distinguish between two types of supervised learning problems depending on the targets $y^{(i)}$. \n", "\n", "1. __Regression__: The target variable $y \\in \\mathcal{Y}$ is continuous: $\\mathcal{Y} \\subseteq \\mathbb{R}$.\n", "\n", "2. __Binary Classification__: The target variable $y$ is discrete and takes on one of $K=2$ possible values.\n", "\n", "In this lecture, we focus on binary classification and assume $\\mathcal{Y} = \\{-1, +1\\}$.\n", "\n", "#### Linear Model Family\n", "\n", "In this lecture, we will work with linear models of the form:\n", "\n", "$$\n", "\\begin{align*}\n", "f_\\theta(x) & = \\theta_0 + \\theta_1 \\cdot x_1 + \\theta_2 \\cdot x_2 + ... + \\theta_d \\cdot x_d\n", "\\end{align*}\n", "$$\n", "\n", "where $x \\in \\mathbb{R}^d$ is a vector of features and $y \\in \\{-1, 1\\}$ is the target. The $\\theta_j$ are the *parameters* of the model.\n", "We can represent the model in a vectorized form as\n", "\n", "$$\n", "\\begin{align*}\n", "f_\\theta(x) = \\theta^\\top x + \\theta_0.\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### 12.1.1.2. Binary Classification Problem and The Iris Dataset\n", "\n", "In this lecture, we will again use the Iris flower dataset. We will transform this problem into a binary classification task by merging the two non-Setosa flowers into one class.\n", "We use $\\mathcal{Y} =\\{-1,1\\}$ as the label space.\n", "\n", "The resulting dataset is partly shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.2-1
45.03.61.40.2-1
84.42.91.40.2-1
124.83.01.40.1-1
165.43.91.30.4-1
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "8 4.4 2.9 1.4 0.2 \n", "12 4.8 3.0 1.4 0.1 \n", "16 5.4 3.9 1.3 0.4 \n", "\n", " target \n", "0 -1 \n", "4 -1 \n", "8 -1 \n", "12 -1 \n", "16 -1 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from sklearn import datasets\n", "\n", "# Load the Iris dataset\n", "iris = datasets.load_iris(as_frame=True)\n", "iris_X, iris_y = iris.data, iris.target\n", "\n", "# subsample to a third of the data points\n", "iris_X = iris_X.loc[::4]\n", "iris_y = iris_y.loc[::4]\n", "\n", "# create a binary classification dataset with labels +/- 1\n", "iris_y2 = iris_y.copy()\n", "iris_y2[iris_y2==2] = 1\n", "iris_y2[iris_y2==0] = -1\n", "\n", "# print part of the dataset\n", "pd.concat([iris_X, iris_y2], axis=1).head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "As in earlier lectures, we visualize this dataset using `matplotlib`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "plt.rcParams['figure.figsize'] = [12, 4]\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "# create 2d version of dataset and subsample it\n", "X = iris_X.to_numpy()[:,:2]\n", "x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n", "y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))\n", "\n", "# Plot also the training points\n", "p1 = plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=60, cmap=plt.cm.Paired)\n", "plt.xlabel('Petal Length')\n", "plt.ylabel('Petal Width')\n", "plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Not Setosa'], loc='lower right')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.1.2. Comparing Classification Algorithms\n", "\n", "We have seen different types approaches to classification. When fitting a model, there may be many valid decision boundaries. How do we select one of them?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Consider the following three classification algorithms from `sklearn`. Each of them outputs a different classification boundary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression, Perceptron, RidgeClassifier\n", "models = [LogisticRegression(), Perceptron(), RidgeClassifier()]\n", "\n", "def fit_and_create_boundary(model):\n", " model.fit(X, iris_y2)\n", " Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " return Z\n", "\n", "plt.figure(figsize=(12,3))\n", "for i, model in enumerate(models):\n", " plt.subplot('13%d' % (i+1))\n", " Z = fit_and_create_boundary(model)\n", " plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired) \n", "\n", " # Plot also the training points\n", " plt.scatter(X[:, 0], X[:, 1], c=iris_y2, edgecolors='k', cmap=plt.cm.Paired)\n", " plt.title('Algorithm %d' % (i+1))\n", " plt.xlabel('Sepal length')\n", " plt.ylabel('Sepal width')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### 12.1.2.1. Classification Scores\n", "\n", "Most classification algorithms output not just a class label but a score.\n", "For example, logistic regression returns the class probability\n", "\n", "$$ \n", "p(y=1|\\mid x) = \\sigma(\\theta^\\top x) \\in [0,1] \n", "$$\n", "\n", "If the class probability is $>0.5$, the model outputs class $1$. \n", "The score is an estimate of confidence; it also represents how far we are from the decision boundary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 12.1.2.2. The Max-Margin Principle\n", "\n", "Intuitively, we want to select boundaries with high *margin*. \n", "This means that we are as confident as possible for every point and we are as far as possible from the decision boundary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Several of the separating boundaries in our previous example had low margin: they came too close to the boundary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.linear_model import Perceptron, RidgeClassifier\n", "from sklearn.svm import SVC\n", "models = [SVC(kernel='linear', C=10000), Perceptron(), RidgeClassifier()]\n", "\n", "def fit_and_create_boundary(model):\n", " model.fit(X, iris_y2)\n", " Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " return Z\n", "\n", "plt.figure(figsize=(12,3))\n", "for i, model in enumerate(models):\n", " plt.subplot('13%d' % (i+1))\n", " Z = fit_and_create_boundary(model)\n", " plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired) \n", "\n", " # Plot also the training points\n", " plt.scatter(X[:, 0], X[:, 1], c=iris_y2, edgecolors='k', cmap=plt.cm.Paired)\n", " if i == 0:\n", " plt.title('Good Margin')\n", " else:\n", " plt.title('Bad Margin')\n", " plt.xlabel('Sepal length')\n", " plt.ylabel('Sepal width')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Below, we plot a decision boundary between the two classes (solid line) that has a high margin. The two dashed lines that lie at the margin.\n", "\n", "Points that are the margin are highlighted in black. A good decision boundary is as far away as possible from the points at the margin." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(2.25, 4.0)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html\n", "from sklearn import svm\n", "\n", "# fit the model, don't regularize for illustration purposes\n", "clf = svm.SVC(kernel='linear', C=1000) # we'll explain this algorithm shortly\n", "clf.fit(X, iris_y2)\n", "\n", "plt.figure(figsize=(5,5))\n", "plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)\n", "Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)\n", "\n", "# plot decision boundary and margins\n", "plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,\n", " linestyles=['--', '-', '--'])\n", "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,\n", " linewidth=1, facecolors='none', edgecolors='k')\n", "plt.xlim([4.6, 6])\n", "plt.ylim([2.25, 4])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## 12.1.3. The Functional Classification Margin\n", "\n", "How can we define the concept of margin more formally?\n", "\n", "We can try to define the margin $\\tilde \\gamma^{(i)}$ with respect to a training example $(x^{(i)}, y^{(i)})$ as\n", "\n", "$$ \n", "\\tilde \\gamma^{(i)} = y^{(i)} \\cdot f(x^{(i)}) = y^{(i)} \\cdot \\left( \\theta^\\top x^{(i)} + \\theta_0 \\right). \n", "$$\n", "\n", "We call this the *functional* margin. Let's analyze it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We defined the functional margin as\n", "\n", "$$ \n", "\\tilde\\gamma^{(i)} = y^{(i)} \\cdot \\left( \\theta^\\top x^{(i)} + \\theta_0 \\right).\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* If $y^{(i)}=1$, then the margin $\\tilde\\gamma^{(i)}$ is large when the model score $f(x^{(i)}) = \\theta^\\top x^{(i)} + \\theta_0$ is positive and large." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* Thus, we are classifying $x^{(i)}$ correctly and with high confidence." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* If $y^{(i)}=-1$, then the margin $\\tilde\\gamma^{(i)}$ is large when the model score $f(x^{(i)}) = \\theta^\\top x^{(i)} + \\theta_0$ is negative and large in absolute value." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* We are again classifying $x^{(i)}$ correctly and with high confidence." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Thus higher margin means higher confidence at each input point. However, we have a problem. \n", "\n", "* If we rescale the parameters $\\theta, \\theta_0$ by a scalar $\\alpha > 0$, we get new parameters $\\alpha \\theta, \\alpha \\theta_0$ .\n", "\n", "* The $\\alpha \\theta, \\alpha \\theta_0$ doesn't change the classification of points.\n", "\n", "* However, the margin $\\left( \\alpha \\theta^\\top x^{(i)} + \\alpha \\theta_0 \\right) = \\alpha \\left( \\theta^\\top x^{(i)} + \\theta_0 \\right)$ is now scaled by $\\alpha$!\n", "\n", "It doesn't make sense that the same classification boundary can have different margins when we rescale it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.1.4. The Geometric Classification Margin\n", "\n", "We define the *geometric* margin $\\gamma^{(i)}$ with respect to a training example $(x^{(i)}, y^{(i)})$ as\n", "\n", "$$ \n", "\\gamma^{(i)} = y^{(i)}\\left( \\frac{\\theta^\\top x^{(i)} + \\theta_0}{||\\theta||} \\right). \n", "$$\n", "\n", "We call it geometric because $\\gamma^{(i)}$ equals the distance between $x^{(i)}$ and the hyperplane." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* We normalize the functional margin by $||\\theta||$ \n", "\n", "* Rescaling the weights does not make the margin arbitrarily large." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Let's make sure our intuition about the margin holds.\n", "\n", "$$ \n", "\\gamma^{(i)} = y^{(i)}\\left( \\frac{\\theta^\\top x^{(i)} + \\theta_0}{||\\theta||} \\right). \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* If $y^{(i)}=1$, then the margin $\\gamma^{(i)}$ is large when the model score $f(x^{(i)}) = \\theta^\\top x^{(i)} + \\theta_0$ is positive and large." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* Thus, we are classifying $x^{(i)}$ correctly and with high confidence." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* The same holds when $y^{(i)}=-1$. We again capture our intuition that increasing margin means increasing the confidence of each input point." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 12.1.4.1. Geometric Intuitions\n", "\n", "The margin $\\gamma^{(i)}$ is called geometric because it corresponds to the distance from $x^{(i)}$ to the separating hyperplane $\\theta^\\top x + \\theta_0 = 0$ (dashed line below).\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Suppose that $y^{(i)}=1$ ($x^{(i)}$ lies on positive side of boundary). Then:\n", "1. The points $x$ that lie on the decision boundary are those for which $\\theta^\\top x + \\theta_0 = 0$ (score is precisely zero, and between 1 and -1)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "2. The vector $\\frac{\\theta}{||\\theta||}$ is perpendicular to the hyperplane $\\theta^\\top x + \\theta_0$ and has unit norm (fact from calculus)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "
\n", "\n", "3. Let $x_0$ be the point on the boundary closest to $x^{(i)}$. Then by definition of the margin\n", "$x^{(i)} = x_0 + \\gamma^{(i)} \\frac{\\theta}{||\\theta||}$ or\n", "\n", "$$ \n", "x_0 = x^{(i)} - \\gamma^{(i)} \\frac{\\theta}{||\\theta||}. \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "4. Since $x_0$ is on the hyperplane, $\\theta^\\top x_0 + \\theta_0 = 0$, or\n", "\n", "$$\n", "\\theta^\\top \\left(x^{(i)} - \\gamma^{(i)} \\frac{\\theta}{||\\theta||} \\right) + \\theta_0 = 0.\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "5. Solving for $\\gamma^{(i)}$ and using the fact that $\\theta^\\top \\theta = ||\\theta||^2$, we obtain\n", "\n", "$$ \n", "\\gamma^{(i)} = \\frac{\\theta^\\top x^{(i)} + \\theta_0}{||\\theta||}. \n", "$$\n", "\n", "Which is our geometric margin. The case of $y^{(i)}=-1$ can also be proven in a similar way." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We can use our formula for $\\gamma$ to precisely plot the margins on our earlier plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plot decision boundary and margins\n", "plt.figure(figsize=(5,5))\n", "plt.scatter(X[:, 0], X[:, 1], c=iris_y2, s=30, cmap=plt.cm.Paired)\n", "plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,\n", " linestyles=['--', '-', '--'])\n", "plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,\n", " linewidth=1, facecolors='none', edgecolors='k')\n", "plt.xlim([4.6, 6.1])\n", "plt.ylim([2.25, 4])\n", "\n", "# plot margin vectors\n", "theta = clf.coef_[0]\n", "theta0 = clf.intercept_\n", "for idx in clf.support_[:3]:\n", " x0 = X[idx]\n", " y0 = iris_y2.iloc[idx]\n", " margin_x0 = (theta.dot(x0) + theta0)[0] / np.linalg.norm(theta)\n", " w = theta / np.linalg.norm(theta)\n", " plt.plot([x0[0], x0[0]-w[0]*margin_x0], [x0[1], x0[1]-w[1]*margin_x0], color='blue')\n", " plt.scatter([x0[0]-w[0]*margin_x0], [x0[1]-w[1]*margin_x0], color='blue')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 12.2. The Max-Margin Classifier\n", "\n", "We have seen a way to measure the confidence level of a classifier at a data point using the notion of a *margin*.\n", "Next, we are going to see how to maximize the margin of linear classifiers." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.2.1. Maximizing the Margin\n", "\n", "We want to define an objective that will result in maximizing the margin. As a first attempt, consider the following optimization problem.\n", "\n", "$$\n", "\\begin{align*}\n", "\\max_{\\theta,\\theta_0,\\gamma} \\gamma \\; & \\\\\n", "\\text{subject to } \\; & y^{(i)}\\frac{(x^{(i)})^\\top\\theta+\\theta_0}{||\\theta||}\\geq \\gamma \\; \\text{for all $i$} \n", "\\end{align*}\n", "$$\n", "\n", "This maximises the smallest margin over the $(x^{(i)}, y^{(i)})$. It guarantees each point has margin at least $\\gamma$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This problem is difficult to optimize because of the division by $||\\theta||$ and we would like to simplify it. First, consider the equivalent problem:\n", "\n", "$$\n", "\\begin{align*}\n", "\\max_{\\theta,\\theta_0,\\gamma} \\gamma \\; & \\\\\n", "\\text{subject to } \\; & y^{(i)}((x^{(i)})^\\top\\theta+\\theta_0)\\geq \\gamma ||\\theta|| \\; \\text{for all $i$}\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that this problem has an extra degree of freedom: \n", "\n", "* Suppose we multiply $\\theta, \\theta_0$ by some constant $c >0$.\n", "\n", "* This yields another valid solution!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To enforce uniqueness, we add another constraint that doesn't change the minimizer:\n", "\n", "$$ \n", "||\\theta|| \\cdot \\gamma = 1. \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This ensures we cannot rescale $\\theta$ and also asks our linear model to assign each $x^{(i)}$ a score of at least $\\pm 1$:\n", "\n", "$$ \n", "y^{(i)}((x^{(i)})^\\top\\theta+\\theta_0)\\geq 1 \\; \\text{for all $i$} \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "If we constraint $||\\theta|| \\cdot \\gamma = 1$ holds, then we know that $\\gamma = 1/||\\theta||$ and we can replace $\\gamma$ in the optimization problem to obtain:\n", "\n", "$$\n", "\\begin{align*}\n", "\\max_{\\theta,\\theta_0} \\frac{1}{||\\theta||} \\; & \\\\\n", "\\text{subject to } \\; & y^{(i)}((x^{(i)})^\\top\\theta+\\theta_0)\\geq 1 \\; \\text{for all $i$}\n", "\\end{align*}\n", "$$\n", "\n", "The solution of this problem is still the same." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Finally, instead of maximizing $1/||\\theta||$, we can minimize $||\\theta||$, or equivalently we can minimize $\\frac{1}{2}||\\theta||^2$.\n", "\n", "$$\n", "\\begin{align*}\n", "\\min_{\\theta,\\theta_0} \\frac{1}{2}||\\theta||^2 \\; & \\\\\n", "\\text{subject to } \\; & y^{(i)}((x^{(i)})^\\top\\theta+\\theta_0)\\geq 1 \\; \\text{for all $i$}\n", "\\end{align*}\n", "$$\n", "\n", "This is now a quadratic program that can be solved using off-the-shelf optimization algorithms!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 12.2.2. Algorithm: Linear Support Vector Machine Classification\n", "The above procedure describes the closed solution for Support Vector Machine. We can succinctly define the algorithm components.\n", "\n", "* __Type__: Supervised learning (binary classification).\n", "\n", "* __Model family__: Linear decision boundaries.\n", "\n", "* __Objective function__: Max-margin optimization.\n", "\n", "* __Optimizer__: Quadratic optimization algorithms.\n", "\n", "* __Probabilistic interpretation__: No simple interpretation!\n", "\n", "Later, we will see several other versions of this algorithm." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 12.3. Soft Margins and the Hinge Loss\n", "\n", "Let's continue looking at how we can maximize the margin." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.3.1. Non-Separable Problems\n", "\n", "So far, we have assume that a linear hyperplane exists. However, what if the classes are non-separable? Then our optimization problem does not have a solution and we need to modify it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Our solution is going to be to make each constraint \"soft\", by introducing \"slack\" variables, which allow the constraint to be violated.\n", "\n", "$$\n", "y^{(i)}((x^{(i)})^\\top\\theta+\\theta_0)\\geq 1 - \\xi_i.\n", "$$\n", "\n", "* If we can classify each point with a perfect score of $\\geq 1$, the $\\xi_i=0$.\n", "\n", "* If we cannot assign a perfect score, we assign a score of $1-\\xi_i$.\n", "\n", "* We define optimization such that the $\\xi_i$ are chosen to be as small as possible." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In the optimization problem, we assign a penalty $C$ to these slack variables to obtain:\n", "\n", "$$\n", "\\begin{align*}\n", "\\min_{\\theta,\\theta_0, \\xi}\\; & \\frac{1}{2}||\\theta||^2 + C \\sum_{i=1}^n \\xi_i \\; \\\\\n", "\\text{subject to } \\; & y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\geq 1 - \\xi_i \\; \\text{for all $i$} \\\\\n", "& \\xi_i \\geq 0\n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.3.2. Towards an Unconstrained Objective\n", "\n", "Let's further modify things. Moving around terms in the inequality we get:\n", "\n", "$$\n", "\\begin{align*}\n", "\\min_{\\theta,\\theta_0, \\xi}\\; & \\frac{1}{2}||\\theta||^2 + C \\sum_{i=1}^n \\xi_i \\; \\\\\n", "\\text{subject to } \\; & \\xi_i \\geq 1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right) \\; \\xi_i \\geq 0 \\; \\text{for all $i$} \n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If $0 \\geq 1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)$, we classified $x^{(i)}$ perfectly and $\\xi_i = 0$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If $0 < 1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)$, then $\\xi_i = 1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Thus, $\\xi_i = \\max\\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right), 0 \\right)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We simplify notation a bit by using the notation $(x)^+ = \\max(x,0)$.\n", "\n", "This yields:\n", "\n", "$$\n", "\\xi_i = \\max\\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right), 0 \\right) := \\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\right)^+\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Since $\\xi_i = \\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\right)^+$, we can take\n", "\n", "$$\n", "\\begin{align*}\n", "\\min_{\\theta,\\theta_0, \\xi}\\; & \\frac{1}{2}||\\theta||^2 + C \\sum_{i=1}^n \\xi_i \\; \\\\\n", "\\text{subject to } \\; & \\xi_i \\geq 1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right) \\; \\xi_i \\geq 0 \\; \\text{for all $i$} \n", "\\end{align*}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And we turn it into the following by plugging in the definition of $\\xi_i$:\n", "\n", "$$ \n", "\\min_{\\theta,\\theta_0}\\; \\frac{1}{2}||\\theta||^2 + C \\sum_{i=1}^n \\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\right)^+ \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Since it doesn't matter which term we multiply by $C>0$, this is equivalent to\n", "\n", "$$ \\min_{\\theta,\\theta_0, \\xi}\\; \\sum_{i=1}^n \\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\right)^+ + \\frac{\\lambda}{2}||\\theta||^2 $$\n", "\n", "for some $\\lambda > 0$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We have now turned our optimizatin problem into an unconstrained form:\n", "\n", "$$ \n", "\\min_{\\theta,\\theta_0}\\; \\sum_{i=1}^n \\underbrace{\\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\right)^+}_\\text{hinge loss} + \\underbrace{\\frac{\\lambda}{2}||\\theta||^2}_\\text{regularizer} \n", "$$\n", "\n", "* The hinge loss penalizes incorrect predictions.\n", "* The L2 regularizer ensures the weights are small and well-behaved." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.3.3. The Hinge Loss\n", "\n", "Consider again our new loss term for a label $y$ and a prediction $f$:\n", "\n", "$$ \n", "L(y, f) = \\max\\left(1 - y \\cdot f, 0\\right). \n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's examine the behavior of this loss on different $y, f$:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* If prediction $f$ has same sign as $y$, and $|f| \\geq 1$, the loss is zero. In other words, if the class is correct, no penalty is applied if the absolute value of the score $f$ is greater than 1." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* However, if the prediction $f$ is of the wrong sign, or $|f| \\leq 1$, the loss is $|y - f|$. Thus, we penalize incorrect predictions, or predictions that are too close to the midpoint between the two class labels (which is at zero, since the labels are $\\pm 1$)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Let's visualize a few losses $L(y=1,f)$, as a function of $f$, including hinge." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'L(y=1,f)')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# define the losses for a target of y=1\n", "hinge_loss = lambda f: np.maximum(1 - f, 0)\n", "l2_loss = lambda f: (1-f)**2\n", "l1_loss = lambda f: np.abs(f-1)\n", "\n", "# plot them\n", "fs = np.linspace(0, 2)\n", "plt.plot(fs, l1_loss(fs), fs, l2_loss(fs), fs, hinge_loss(fs), linewidth=9, alpha=0.5)\n", "plt.legend(['L1 Loss', 'L2 Loss', 'Hinge Loss'])\n", "plt.xlabel('Prediction f')\n", "plt.ylabel('L(y=1,f)')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can make a few interesting observations:\n", "* The hinge loss is linear like the L1 loss.\n", "* But it only penalizes errors that are on the \"wrong\" side: \n", " * We have an error of $|f-y|$ if true class is $1$ and $f < 1$\n", " * We don't penalize for predicting $f>1$ if true class is $1$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(fs, hinge_loss(fs), linewidth=9, alpha=0.5)\n", "plt.legend(['Hinge Loss'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 12.3.3.1. Properties of the Hinge Loss\n", "\n", "The hinge loss is one of the best losses in machine learning. We summarize here several important properties of the hinge loss.\n", "\n", "* It penalizes errors \"that matter,” hence is less sensitive to outliers.\n", "\n", "* Minimizing a regularized hinge loss optimizes for a high margin.\n", "\n", "* The loss is non-differentiable at point, which may make it more challenging to optimize." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 12.4. Optimization for SVMs\n", "\n", "We have seen a new way to formulate the SVM objective. Let's now see how to optimize it." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.4.0 Review\n", "\n", "### 12.4.0.1. Review: SVM Objective\n", "\n", "Maximizing the margin can be done in the following form:\n", "\n", "$$ \n", "\\min_{\\theta,\\theta_0, \\xi}\\; \\sum_{i=1}^n \\underbrace{\\left(1 - y^{(i)}\\left((x^{(i)})^\\top\\theta+\\theta_0\\right)\\right)^+}_\\text{hinge loss} + \\underbrace{\\frac{\\lambda}{2}||\\theta||^2}_\\text{regularizer} \n", "$$\n", "\n", "* The hinge loss penalizes incorrect predictions.\n", "* The L2 regularizer ensures the weights are small and well-behaved." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can easily implement this objective in `numpy`.\n", "First we define the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def f(X, theta):\n", " \"\"\"The linear model we are trying to fit.\n", " \n", " Parameters:\n", " theta (np.array): d-dimensional vector of parameters\n", " X (np.array): (n,d)-dimensional data matrix\n", " \n", " Returns:\n", " y_pred (np.array): n-dimensional vector of predicted targets\n", " \"\"\"\n", " return X.dot(theta)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And then we define the objective." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def svm_objective(theta, X, y, C=.1):\n", " \"\"\"The cost function, J, describing the goodness of fit.\n", " \n", " Parameters:\n", " theta (np.array): d-dimensional vector of parameters\n", " X (np.array): (n,d)-dimensional design matrix\n", " y (np.array): n-dimensional vector of targets\n", " \"\"\"\n", " return (np.maximum(1 - y * f(X, theta), 0) + C * 0.5 * np.linalg.norm(theta[:-1])**2).mean()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### 12.4.0.2. Review: Gradient Descent\n", "If we want to optimize $J(\\theta)$, we start with an initial guess $\\theta_0$ for the parameters and repeat the following update:\n", "\n", "$$ \n", "\\theta_i := \\theta_{i-1} - \\alpha \\cdot \\nabla_\\theta J(\\theta_{i-1}). \n", "$$\n", "\n", "As code, this method may look as follows:\n", "```python\n", "theta, theta_prev = random_initialization()\n", "while norm(theta - theta_prev) > convergence_threshold:\n", " theta_prev = theta\n", " theta = theta_prev - step_size * gradient(theta_prev)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.4.1. A Gradient for the Hinge Loss?\n", "\n", "What is the gradient for the hinge loss with a linear $f$?\n", "\n", "$$ \n", "J(\\theta) = \\max\\left(1 - y \\cdot f_\\theta(x), 0\\right) = \\max\\left(1 - y \\cdot \\theta^\\top x, 0\\right). \n", "$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here, you see the linear part of $J$ that behaves like $1 - y \\cdot f_\\theta(x)$ (when $y \\cdot f_\\theta(x) < 1$) in orange:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.plot(fs, hinge_loss(fs),fs[:25], hinge_loss(fs[:25]), linewidth=9, alpha=0.5)\n", "plt.legend(['Hinge Loss', 'Hinge Loss when $y \\cdot f < 1$'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "When $y \\cdot f_\\theta(x) < 1$, we are in the \"orange line\" part and $J(\\theta)$ behaves like $1 - y \\cdot f_\\theta(x)$.\n", "\n", "\n", "Hence the gradient in this regime is: \n", "\n", "$$\n", "\\nabla_\\theta J(\\theta) = -y \\cdot \\nabla f_\\theta(x) = -y \\cdot x\n", "$$\n", "\n", "where we used $\\nabla_\\theta \\theta^\\top x = x$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "When $y \\cdot f_\\theta(x) \\geq 1$, we are in the \"flat\" part and $J(\\theta) = 0$.\n", "Hence the gradient is also just zero!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "What is the gradient for the hinge loss with a linear $f$?\n", "\n", "$$ \n", "J(\\theta) = \\max\\left(1 - y \\cdot f_\\theta(x), 0\\right) = \\max\\left(1 - y \\cdot \\theta^\\top x, 0\\right). \n", "$$\n", "\n", "When $y \\cdot f_\\theta(x) = 1$, we are in the \"kink\", and the gradient is not defined!\n", "* In practice, we can either take the gradient when $y \\cdot f_\\theta(x) > 1$ or the gradient when $y \\cdot f_\\theta(x) < 1$ or anything in between. This is called the *subgradient*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 12.4.1.1. A Steepest Descent Direction for the Hinge Loss\n", "\n", "We can define a \"gradient\" like function $\\tilde \\nabla_\\theta J(\\theta)$ for the hinge loss\n", "\n", "$$ \n", "J(\\theta) = \\max\\left(1 - y \\cdot f_\\theta(x), 0\\right) = \\max\\left(1 - y \\cdot \\theta^\\top x, 0\\right). \n", "$$\n", "\n", "It equals:\n", "\n", "$$\n", "\\tilde \\nabla_\\theta J(\\theta) = \\begin{cases} -y \\cdot x & \\text{ if $y \\cdot f_\\theta(x) < 1$} \\\\ 0 & \\text{ otherwise} \\end{cases} \n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.4.2. (Sub-)Gradient Descent for SVM\n", "\n", "Putting this together, we obtain a gradient descent algorithm (technically, it's called subgradient descent).\n", "\n", "\n", "```python\n", "theta, theta_prev = random_initialization()\n", "while abs(J(theta) - J(theta_prev)) > conv_threshold:\n", " theta_prev = theta\n", " theta = theta_prev - step_size * approximate_gradient\n", "```\n", "\n", "Let's implement this algorithm." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "First we implement the approximate gradient." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def svm_gradient(theta, X, y, C=.1):\n", " \"\"\"The (approximate) gradient of the cost function.\n", " \n", " Parameters:\n", " theta (np.array): d-dimensional vector of parameters\n", " X (np.array): (n,d)-dimensional design matrix\n", " y (np.array): n-dimensional vector of targets\n", " \n", " Returns:\n", " subgradient (np.array): d-dimensional subgradient\n", " \"\"\"\n", " yy = y.copy()\n", " yy[y*f(X,theta)>=1] = 0\n", " subgradient = np.mean(-yy * X.T, axis=1)\n", " subgradient[:-1] += C * theta[:-1]\n", " return subgradient" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And then we implement subgradient descent." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration 0. J: 3.728947\n", "Iteration 1000. J: 0.376952\n", "Iteration 2000. J: 0.359075\n", "Iteration 3000. J: 0.351587\n", "Iteration 4000. J: 0.344411\n", "Iteration 5000. J: 0.337912\n", "Iteration 6000. J: 0.331617\n", "Iteration 7000. J: 0.326604\n", "Iteration 8000. J: 0.322224\n", "Iteration 9000. J: 0.319250\n", "Iteration 10000. J: 0.316727\n", "Iteration 11000. J: 0.314800\n", "Iteration 12000. J: 0.313181\n", "Iteration 13000. J: 0.311843\n", "Iteration 14000. J: 0.310667\n", "Iteration 15000. J: 0.309561\n", "Iteration 16000. J: 0.308496\n", "Iteration 17000. J: 0.307523\n", "Iteration 18000. J: 0.306614\n", "Iteration 19000. J: 0.305768\n", "Iteration 20000. J: 0.305068\n", "Iteration 21000. J: 0.304293\n" ] } ], "source": [ "threshold = 5e-4\n", "step_size = 1e-2\n", "\n", "theta, theta_prev = np.ones((3,)), np.zeros((3,))\n", "iter = 0\n", "iris_X['one'] = 1\n", "X_train = iris_X.iloc[:,[0,1,-1]].to_numpy()\n", "y_train = iris_y2.to_numpy()\n", "\n", "while np.linalg.norm(theta - theta_prev) > threshold:\n", " if iter % 1000 == 0:\n", " print('Iteration %d. J: %.6f' % (iter, svm_objective(theta, X_train, y_train)))\n", " theta_prev = theta\n", " gradient = svm_gradient(theta, X_train, y_train)\n", " theta = theta_prev - step_size * gradient\n", " iter += 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can visualize the results to convince ourselves we found a good boundary." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "xx, yy = np.meshgrid(np.arange(x_min, x_max, .02), np.arange(y_min, y_max, .02))\n", "Z = f(np.c_[xx.ravel(), yy.ravel(), np.ones(xx.ravel().shape)], theta)\n", "Z[Z<0] = 0\n", "Z[Z>0] = 1\n", "\n", "# Put the result into a color plot\n", "Z = Z.reshape(xx.shape)\n", "plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)\n", "\n", "# Plot also the training points\n", "plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.Paired)\n", "plt.xlabel('Sepal length')\n", "plt.ylabel('Sepal width')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 12.4.3. Algorithm: Linear Support Vector Machine Classification\n", "The above procedure describes gradient descent optimization for support vector machines. The algorithm card below summarizes this algorithm and its components.\n", "\n", "* __Type__: Supervised learning (binary classification)\n", "\n", "* __Model family__: Linear decision boundaries.\n", "\n", "* __Objective function__: L2-regularized hinge loss.\n", "\n", "* __Optimizer__: Subgradient descent.\n", "\n", "* __Probabilistic interpretation__: No simple interpretation!" ] } ], "metadata": { "accelerator": "GPU", "celltoolbar": "Slideshow", "colab": { "collapsed_sections": [], "name": "neural-ode.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" }, "rise": { "controlsTutorial": false, "height": 900, "help": false, "margin": 0, "maxScale": 2, "minScale": 0.2, "progress": true, "scroll": true, "theme": "simple", "width": 1200 }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 1 }