{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "i_f5u2x9nn6I", "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "\n", "# Lecture 11: Kernels\n", "\n", "### Applied Machine Learning\n", "\n", "__Volodymyr Kuleshov__
Cornell Tech" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1: The Kernel Trick: Motivation\n", "\n", "So far, the majority of the machine learning models we have seen have been *linear*.\n", "\n", "In this lecture, we will see a general way to make many of these models *non-linear*. We willl use a new idea called *kernels*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Linear Regression\n", "\n", "Recall that a linear model has the form\n", "$$f(x) = \\sum_{j=0}^d \\theta_j \\cdot x_j = \\theta^\\top x.$$\n", "where $x$ is a vector of features and we used the notation $x_0 = 1$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We pick $\\theta$ to minimize the (L2-regularized) mean squared error (MSE):\n", "$$J(\\theta)= \\frac{1}{2n} \\sum_{i=1}^n(y^{(i)} - \\theta^\\top x^{(i)})^2 + \\frac{\\lambda}{2}\\sum_{j=1}^d \\theta_j^2$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Review: Polynomials\n", "\n", "Recall that a polynomial of degree $p$ is a function of the form\n", "$$\n", "a_p x^p + a_{p-1} x^{p-1} + ... + a_{1} x + a_0.\n", "$$\n", "\n", "Below are some examples of polynomial functions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Review: Polynomial Regression\n", "\n", "Specifically, given a one-dimensional continuous variable $x$, we can defining a feature function $\\phi : \\mathbb{R} \\to \\mathbb{R}^{p+1}$ as\n", "$$\\phi(x) = \\begin{bmatrix}\n", "1 \\\\\n", "x \\\\\n", "x^2 \\\\\n", "\\vdots \\\\\n", "x^p\n", "\\end{bmatrix}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The class of models of the form\n", "$$f_\\theta(x) := \\sum_{j=0}^p \\theta_p x^p = \\theta^\\top \\phi(x)$$\n", "with parameters $\\theta$ and polynomial features $\\phi$ is the set of $p$-degree polynomials." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Towards General Non-Linear Features\n", "\n", "Any non-linear feature map $\\phi(x) : \\mathbb{R}^d \\to \\mathbb{R}^p$ can be used to obtain general models of the form\n", "$$f_\\theta(x) := \\theta^\\top \\phi(x)$$\n", "that are highly non-linear in $x$ but linear in $\\theta$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Featurized Design Matrix\n", "\n", "It is useful to represent the featurized dataset as a matrix $\\Phi \\in \\mathbb{R}^{n \\times p}$:\n", "\n", "$$\\Phi = \\begin{bmatrix}\n", "\\phi(x^{(1)})_1 & \\phi(x^{(1)})_2 & \\ldots & \\phi(x^{(1)})_p \\\\\n", "\\phi(x^{(2)})_1 & \\phi(x^{(2)})_2 & \\ldots & \\phi(x^{(2)})_p \\\\\n", "\\vdots \\\\\n", "\\phi(x^{(n)})_1 & \\phi(x^{(n)})_2 & \\ldots & \\phi(x^{(n)})_p\n", "\\end{bmatrix}\n", "=\n", "\\begin{bmatrix}\n", "- & \\phi(x^{(1)})^\\top & - \\\\\n", "- & \\phi(x^{(2)})^\\top & - \\\\\n", "& \\vdots & \\\\\n", "- & \\phi(x^{(n)})^\\top & - \\\\\n", "\\end{bmatrix}\n", ".$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Featurized Normal Equations\n", "\n", "The normal equations provide a closed-form solution for $\\theta$:\n", "$$\\theta = (X^\\top X + \\lambda I)^{-1} X^\\top y.$$\n", "\n", "When the vectors of attributes $x^{(i)}$ are featurized, we can write this as\n", "$$\\theta = (\\Phi^\\top \\Phi + \\lambda I)^{-1} \\Phi^\\top y.$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Push-Through Matrix Identity\n", "\n", "We can modify this expression by using a version of the [push-through matrix identity](https://en.wikipedia.org/wiki/Woodbury_matrix_identity#Discussion):\n", "$$(\\lambda I + U V)^{-1} U = U (\\lambda I + V U)^{-1}$$\n", "where $U \\in \\mathbb{R}^{n \\times m}$ and $V \\in \\mathbb{R}^{m \\times n}$ and $\\lambda \\neq 0$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Proof sketch: Start with $U (\\lambda I + V U) = (\\lambda I + U V) U$ and multiply both sides by $(\\lambda I + V U)^{-1}$ on the right and $(\\lambda I + U V)^{-1}$ on the left." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Normal Equations: Dual Form\n", "\n", "We can apply the identity $(\\lambda I + U V)^{-1} U = U (\\lambda I + V U)^{-1}$ to the normal equations with $U=\\Phi^\\top$ and $V=\\Phi$.\n", "\n", "$$\\theta = (\\Phi^\\top \\Phi + \\lambda I)^{-1} \\Phi^\\top y$$\n", "\n", "to obtain the *dual* form:\n", "\n", "$$\\theta = \\Phi^\\top (\\Phi \\Phi^\\top + \\lambda I)^{-1} y.$$\n", "\n", "The first approach takes $O(p^3)$ time; the second is $O(n^3)$ and is faster when $p > n$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Feature Representations for Parameters\n", "\n", "An interesting corollary of the dual form\n", "$$\\theta = \\Phi^\\top \\underbrace{(\\Phi \\Phi^\\top + \\lambda I)^{-1} y}_\\alpha$$\n", "is that the optimal $\\theta$ is a linear combination of the $n$ training set features:\n", "$$\\theta = \\sum_{i=1}^n \\alpha_i \\phi(x^{(i)}).$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here, the weights $\\alpha_i$ are derived from $(\\Phi \\Phi^\\top + \\lambda I)^{-1} y$ and equal\n", "$$\\alpha_i = \\sum_{j=1}^n L_{ij} y_j$$\n", "where $L = (\\Phi \\Phi^\\top + \\lambda I)^{-1}.$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Predictions From Features\n", "\n", "Consider now a prediction $\\phi(x')^\\top \\theta$ at a new input $x'$:\n", "$$\\phi(x')^\\top \\theta = \\sum_{i=1}^n \\alpha_i \\phi(x')^\\top \\phi(x^{(i)}).$$\n", "\n", "The crucial observation is that the features $\\phi(x)$ are never used directly in this equation. Only their dot product is used!\n", "\n", "This observation will be at the heart of a powerful new idea called *the kernel trick*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Learning From Feature Products\n", "\n", "We also don't need features $\\phi$ for learning $\\theta$, just their dot product! \n", "First, recall that each row $i$ of $\\Phi$ is the $i$-th featurized input $\\phi(x^{(i)})^\\top$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Thus $K = \\Phi \\Phi^\\top$ is a matrix of all dot products between all the $\\phi(x^{(i)})$\n", "$$K_{ij} = \\phi(x^{(i)})^\\top \\phi(x^{(j)}).$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can compute $\\alpha = (K+\\lambda I)^{-1}y$ and use it for predictions\n", "$$\\phi(x')^\\top \\theta = \\sum_{i=1}^n \\alpha_i \\phi(x')^\\top \\phi(x^{(i)}).$$\n", "and all this only requires dot products, not features $\\phi$!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Kernel Trick\n", "\n", "The above observations hint at a powerful new idea -- if we can compute dot products of features $\\phi(x)$ efficiently, then we will be able to use high-dimensional features easily.\n", "\n", "It turns our that we can do this for many ML algorithms -- we call this the Kernel Trick." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "# Part 2: The Kernel Trick: An Example\n", "\n", "Many ML algorithms can be written down as optimization problems in which the features $\\phi(x)$ only appear as dot products $\\phi(x)^\\top \\phi(z)$ that can be computed efficiently.\n", "\n", "Let's look at an example." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Linear Regression\n", "\n", "Recall that a linear model has the form\n", "$$f(x) = \\sum_{j=0}^d \\theta_j \\cdot x_j = \\theta^\\top x.$$\n", "where $x$ is a vector of features and we used the notation $x_0 = 1$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Non-Linear Features\n", "\n", "Any non-linear feature map $\\phi(x) : \\mathbb{R}^d \\to \\mathbb{R}^p$ can be used in this way to obtain general models of the form\n", "$$f_\\theta(x) := \\theta^\\top \\phi(x)$$\n", "that are highly non-linear in $x$ but linear in $\\theta$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Featurized Design Matrix\n", "\n", "It is useful to represent the featurized dataset as a matrix $\\Phi \\in \\mathbb{R}^{n \\times p}$:\n", "\n", "$$\\Phi = \\begin{bmatrix}\n", "\\phi(x^{(1)})_1 & \\phi(x^{(1)})_2 & \\ldots & \\phi(x^{(1)})_p \\\\\n", "\\phi(x^{(2)})_1 & \\phi(x^{(2)})_2 & \\ldots & \\phi(x^{(2)})_p \\\\\n", "\\vdots \\\\\n", "\\phi(x^{(n)})_1 & \\phi(x^{(n)})_2 & \\ldots & \\phi(x^{(n)})_p\n", "\\end{bmatrix}\n", "=\n", "\\begin{bmatrix}\n", "- & \\phi(x^{(1)})^\\top & - \\\\\n", "- & \\phi(x^{(2)})^\\top & - \\\\\n", "& \\vdots & \\\\\n", "- & \\phi(x^{(n)})^\\top & - \\\\\n", "\\end{bmatrix}\n", ".$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Normal Equations\n", "\n", "The normal equations provide a closed-form solution for $\\theta$:\n", "\n", "$$\\theta = (\\Phi^\\top \\Phi + \\lambda I)^{-1} \\Phi^\\top y.$$\n", "\n", "They also can be written in this form:\n", "\n", "$$\\theta = \\Phi^\\top (\\Phi \\Phi^\\top + \\lambda I)^{-1} y.$$\n", "\n", "The first approach takes $O(d^3)$ time; the second is $O(n^3)$ and is faster when $d > n$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Learning From Feature Products\n", "\n", "An interesting corollary is that the optimal $\\theta$ is a linear combination of the $n$ training set features:\n", "$$\\theta = \\sum_{i=1}^n \\alpha_i \\phi(x^{(i)}).$$\n", "We can compute a prediction $\\phi(x')^\\top \\theta$ for $x'$ without ever using the features (only their dot products):\n", "$$\\phi(x')^\\top \\theta = \\sum_{i=1}^n \\alpha_i \\phi(x')^\\top \\phi(x^{(i)}).$$\n", "Equally importantly, we can learn $\\theta$ from only dot products." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Polynomial Regression\n", "\n", "Note that a $p$-th degree polynomial\n", "\n", "$$\n", "a_p x^p + a_{p-1} x^{p-1} + ... + a_{1} x + a_0.\n", "$$\n", "\n", "forms a linear model with parameters $a_p, a_{p-1}, ..., a_0$.\n", "This means we can use our algorithms for linear models to learn non-linear features!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Specifically, given a one-dimensional continuous variable $x$, we can defining a feature function $\\phi : \\mathbb{R} \\to \\mathbb{R}^p$ as\n", "\n", "$$\\phi(x) = \\begin{bmatrix}\n", "1 \\\\\n", "x \\\\\n", "x^2 \\\\\n", "\\vdots \\\\\n", "x^p\n", "\\end{bmatrix}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then the class of models of the form\n", "$$f_\\theta(x) := \\sum_{j=0}^p \\theta_p x^p = \\theta^\\top \\phi(x)$$\n", "with parameters $\\theta$ encompasses the set of $p$-degree polynomials. Specifically,\n", "* It is non-linear in the input variable $x$, meaning that we can model complex data relationships.\n", "* It is a linear model as a function of the parameters $\\theta$, meaning that we can use our familiar ordinary least squares algorithm to learn these features." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Kernel Trick: A First Example\n", "\n", "Can we compute the dot product $\\phi(x)^\\top \\phi(x')$ of polynomial features $\\phi(x)$ more efficiently than using the standard definition of a dot product? Let's look at an example.\n", "\n", "To start, consider polynomial features $\\phi : \\mathbb{R}^d \\to \\mathbb{R}^{d^2}$ of the form\n", "\n", "$$\\phi(x)_{ij} = x_i x_j \\;\\text{ for i,j \\in \\{1,2,\\ldots,d\\}}.$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For $d=3$ this looks like\n", "$$\\small \\phi(x) = \\begin{bmatrix}\n", "x_1 x_1 \\\\\n", "x_1 x_2 \\\\\n", "x_1 x_3 \\\\\n", "x_2 x_1 \\\\\n", "x_2 x_1 \\\\\n", "x_2 x_2 \\\\\n", "x_3 x_3 \\\\\n", "x_3 x_1 \\\\\n", "x_3 x_2 \\\\\n", "x_3 x_3 \\\\\n", "\\end{bmatrix}.\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The product of $x$ and $z$ in feature space equals:\n", "$$\\phi(x)^\\top \\phi(z) = \\sum_{i=1}^d \\sum_{j=1}^d x_i x_j z_i z_j$$\n", "Computing this dot product invovles the sum over $d^2$ terms and takes $O(d^2)$ time." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "An altenative way of computing the dot product $\\phi(x)^\\top \\phi(z)$ is to instead compute $(x^\\top z)^2$. One can check that this has the same result:\n", "\\begin{align*}\n", "(x^\\top z)^2 & = (\\sum_{i=1}^d x_i z_i)^2 \\\\\n", "& = (\\sum_{i=1}^d x_i z_i) \\cdot (\\sum_{j=1}^d x_j z_j) \\\\\n", "& = \\sum_{i=1}^d \\sum_{j=1}^d x_i z_i x_j z_j \\\\\n", "& = \\phi(x)^\\top \\phi(z)\n", "\\end{align*}\n", "\n", "However, computing $(x^\\top z)^2$ can be done in only $O(d)$ time! " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "This is a very powerful idea:\n", "* We can compute the dot product between $O(d^2)$ features in only $O(d)$ time.\n", "* We can use high-dimensional features within ML algorithms that only rely on dot products (like kernelized ridge regression) without incurring extra costs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Kernel Trick: Polynomial Features\n", "\n", "The number of polynomial features $\\phi_p$ of degree $p$ when $x \\in \\mathbb{R}^d$ \n", "\n", "$$\\phi_p(x)_{i_1, i_2, \\ldots, i_p} = x_{i_1} x_{i_2} \\cdots x_{i_p} \\;\\text{ for i_1, i_2, \\ldots, i_p \\in \\{1,2,\\ldots,d\\}}$$\n", "\n", "scales as $O(d^p)$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, we can compute the dot product $\\phi_p(x)^\\top \\phi_p(z)$ in this feature space in only $O(d)$ time for any $p$ as:\n", "$$\\phi_p(x)^\\top \\phi_p(z) = (x^\\top z)^p.$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Algorithm: Kernelized Polynomial Ridge Regression\n", "\n", "* __Type__: Supervised learning (Regression)\n", "* __Model family__: Polynomials.\n", "* __Objective function__: $L2$-regularized ridge regression.\n", "* __Optimizer__: Normal equations (dual form).\n", "* __Probabilistic interpretation__: No simple interpretation!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Kernel Trick: General Idea\n", "\n", "Many types of features $\\phi(x)$ have the property that their dot product $\\phi(x)^\\top \\phi(z)$ can be computed more efficiently than if we had to form these features explicitly." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Also, we will see that many algorithms in machine learning can be written down as optimization problems in which the features $\\phi(x)$ only appear as dot products $\\phi(x)^\\top \\phi(z)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The *Kernel Trick* means that we can use complex non-linear features within these algorithms with little additional computational cost." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Examples of algorithms in which we can use the Kernel trick:\n", "* Supervised learning algorithms: linear regression, logistic regression, support vector machines, etc.\n", "* Unsupervised learning algorithms: PCA, density estimation.\n", "\n", "We will look at more examples shortly." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "# Part 3: The Kernel Trick in SVMs\n", "\n", "Many ML algorithms can be written down as optimization problems in which the features $\\phi(x)$ only appear as dot products $\\phi(x)^\\top \\phi(z)$ that can be computed efficiently.\n", "\n", "We will now see how SVMs can benefit from the Kernel Trick as well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Binary Classification\n", "\n", "Consider a training dataset $\\mathcal{D} = \\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \\ldots, (x^{(n)}, y^{(n)})\\}$.\n", "\n", "We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$. \n", "\n", "1. __Regression__: The target variable $y \\in \\mathcal{Y}$ is continuous: $\\mathcal{Y} \\subseteq \\mathbb{R}$.\n", "2. __Binary Classification__: The target variable $y$ is discrete and takes on one of $K=2$ possible values.\n", "\n", "In this lecture, we assume $\\mathcal{Y} = \\{-1, +1\\}$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: SVM Model Family\n", "\n", "We will consider models of the form\n", "\n", "\\begin{align*}\n", "f_\\theta(x) = \\theta^\\top \\phi(x) + \\theta_0\n", "\\end{align*}\n", "\n", "where $x$ is the input and $y \\in \\{-1, 1\\}$ is the target. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Primal and Dual Formulations\n", "\n", "Recall that the the max-margin hyperplane can be formualted as the solution to the following *primal* optimization problem.\n", "\\begin{align*}\n", "\\min_{\\theta,\\theta_0, \\xi}\\; & \\frac{1}{2}||\\theta||^2 + C \\sum_{i=1}^n \\xi_i \\; \\\\\n", "\\text{subject to } \\; & y^{(i)}((x^{(i)})^\\top\\theta+\\theta_0)\\geq 1 - \\xi_i \\; \\text{for all $i$} \\\\\n", "& \\xi_i \\geq 0\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The solution to this problem also happens to be given by the following *dual* problem:\n", "\\begin{align*}\n", "\\max_{\\lambda} & \\sum_{i=1}^n \\lambda_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{k=1}^n \\lambda_i \\lambda_k y^{(i)} y^{(k)} (x^{(i)})^\\top x^{(k)} \\\\\n", "\\text{subject to } \\; & \\sum_{i=1}^n \\lambda_i y^{(i)} = 0 \\\\\n", "& C \\geq \\lambda_i \\geq 0 \\; \\text{for all $i$}\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Primal Solution\n", "\n", "We can obtain a primal solution from the dual via the following equation:\n", "$$\n", "\\theta^* = \\sum_{i=1}^n \\lambda_i^* y^{(i)} \\phi(x^{(i)}).\n", "$$\n", "\n", "Ignoring the $\\theta_0$ term for now, the score at a new point $x'$ will equal\n", "$$\n", "\\theta^\\top \\phi(x') = \\sum_{i=1}^n \\lambda_i^* y^{(i)} \\phi(x^{(i)})^\\top \\phi(x').\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Kernel Trick in SVMs\n", "\n", "Notice that in both equations, the features $x$ are never used directly. Only their *dot product* is used.\n", "\\begin{align*}\n", "\\sum_{i=1}^n \\lambda_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{k=1}^n \\lambda_i \\lambda_k y^{(i)} y^{(k)} \\phi(x^{(i)})^\\top \\phi(x^{(k)}) \\\\\n", "\\theta^\\top \\phi(x') = \\sum_{i=1}^n \\lambda_i^* y^{(i)} \\phi(x^{(i)})^\\top \\phi(x').\n", "\\end{align*}\n", "\n", "If we can compute the dot product efficiently, we can potentially use very complex features." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The Kernel Trick in SVMs\n", "\n", "More generally, given features $\\phi(x)$, suppose that we have a function $K : \\mathcal{X} \\times \\mathcal{X} \\to [0, \\infty]$ that outputs dot products between vectors in $\\mathcal{X}$\n", "\n", "$$K(x, z) = \\phi(x)^\\top \\phi(z).$$\n", "\n", "We will call $K$ the *kernel* function." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Recall that an example of a useful kernel function is\n", "$$K(x,z) = (x \\cdot z)^p$$\n", "because it computes the dot product of polynomial features of degree $p$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Then notice that we can rewrite the dual of the SVM as\n", "\\begin{align*}\n", "\\max_{\\lambda} & \\sum_{i=1}^n \\lambda_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{k=1}^n \\lambda_i \\lambda_k y^{(i)} y^{(k)} K(x^{(i)}, x^{(k)}) \\\\\n", "\\text{subject to } \\; & \\sum_{i=1}^n \\lambda_i y^{(i)} = 0 \\\\\n", "& C \\geq \\lambda_i \\geq 0 \\; \\text{for all $i$}\n", "\\end{align*}\n", "and predictions at a new point $x'$ are given by $\\sum_{i=1}^n \\lambda_i^* y^{(i)} K(x^{(i)}, x').$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Using our earlier trick, we can use polynomial features of any degree $p$ in SVMs without forming these features and at no extra cost!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Algorithm: Kernelized Support Vector Machine Classification (Dual Form)\n", "\n", "* __Type__: Supervised learning (binary classification)\n", "* __Model family__: Non-linear decision boundaries.\n", "* __Objective function__: Dual of SVM optimization problem.\n", "* __Optimizer__: Sequential minimial optimization.\n", "* __Probabilistic interpretation__: No simple interpretation!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "# Part 4: Types of Kernels\n", "\n", "Now that we saw the kernel trick, let's look at several examples of kernels." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Linear Model Family\n", "\n", "We will consider models of the form\n", "\n", "\\begin{align*}\n", "f_\\theta(x) = \\theta^\\top \\phi(x) + \\theta_0\n", "\\end{align*}\n", "\n", "where $x$ is the input and $y$ is the target. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Kernel Trick for Ridge Regression\n", "\n", "The normal equations provide a closed-form solution for $\\theta$:\n", "\n", "$$\\theta = (\\Phi^\\top \\Phi + \\lambda I)^{-1} \\Phi^\\top y.$$\n", "\n", "They also can be written in this form:\n", "\n", "$$\\theta = \\Phi^\\top (\\Phi \\Phi^\\top + \\lambda I)^{-1} y.$$\n", "\n", "The first approach takes $O(d^3)$ time; the second is $O(n^3)$ and is faster when $d > n$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "An interesting corollary is that the optimal $\\theta$ is a linear combination of the $n$ training set features:\n", "$$\\theta = \\sum_{i=1}^n \\alpha_i \\phi(x^{(i)}).$$\n", "We can compute a prediction $\\phi(x')^\\top \\theta$ for $x'$ without ever using the features (only their dot products):\n", "$$\\phi(x')^\\top \\theta = \\sum_{i=1}^n \\alpha_i \\phi(x')^\\top \\phi(x^{(i)}).$$\n", "Equally importantly, we can learn $\\theta$ from only dot products." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Kernel Trick in SVMs\n", "\n", "Notice that in both equations, the features $x$ are never used directly. Only their *dot product* is used.\n", "\\begin{align*}\n", "\\sum_{i=1}^n \\lambda_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{k=1}^n \\lambda_i \\lambda_k y^{(i)} y^{(k)} \\phi(x^{(i)})^\\top \\phi(x^{(k)}) \\\\\n", "\\theta^\\top \\phi(x') = \\sum_{i=1}^n \\lambda_i^* y^{(i)} \\phi(x^{(i)})^\\top \\phi(x').\n", "\\end{align*}\n", "\n", "If we can compute the dot product efficiently, we can potentially use very complex features." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Definition: Kernels\n", "\n", "The *kernel* corresponding to features $\\phi(x)$ is a function $K : \\mathcal{X} \\times \\mathcal{X} \\to [0, \\infty]$ that outputs dot products between vectors in $\\mathcal{X}$\n", "$$K(x, z) = \\phi(x)^\\top \\phi(z).$$\n", "\n", "We will also consider general functions $K : \\mathcal{X} \\times \\mathcal{X} \\to [0, \\infty]$ and call these *kernel functions*." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Kernels have various intepreations:\n", "* The dot product or geometrical angle between $x$ and $z$\n", "* A notion of similarity between $x$ and $z$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In order to illustrate kernels, we will use this dataset." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(-3.0, 3.0)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "