Lecture 3: Optimization and Linear Regression

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Optimization and Calculus Background

In the previous lecture, we learned what is a supervised machine learning problem.

Before we turn our attention to Linear Regression, we will first dive deeper into the question of optimization.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \text{Dataset} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

The predictive model is chosen to model the relationship between inputs and targets. For instance, it can predict future targets.

Optimizer: Notation

At a high-level an optimizer takes

\begin{align*} \min_{f \in \mathcal{M}} J(f) \end{align*}

Intuitively, this is the function that bests "fits" the data on the training dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$.

We will use the a quadratic function as our running example for an objective $J$.

We can visualize it.

Calculus Review: Derivatives

Recall that the derivative $$\frac{d f(\theta_0)}{d \theta}$$ of a univariate function $f : \mathbb{R} \to \mathbb{R}$ is the instantaneous rate of change of the function $f(\theta)$ with respect to its parameter $\theta$ at the point $\theta_0$.

Calculus Review: Partial Derivatives

The partial derivative $$\frac{\partial f(\theta_0)}{\partial \theta_j}$$ of a multivariate function $f : \mathbb{R}^d \to \mathbb{R}$ is the derivative of $f$ with respect to $\theta_j$ while all othe other inputs $\theta_k$ for $k\neq j$ are fixed.

Calculus Review: The Gradient

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta_0$ as

$$ \nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.$$

The $j$-th entry of the vector $\nabla_\theta f (\theta_0)$ is the partial derivative $\frac{\partial f(\theta_0)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

We will use a quadratic function as a running example.

Let's visualize this function.

Let's write down the derivative of the quadratic function.

We can visualize the derivative.

Part 1b: Gradient Descent

Next, we will use gradients to define an important algorithm called gradient descent.

Calculus Review: The Gradient

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta_0$ as

$$ \nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.$$

The $j$-th entry of the vector $\nabla_\theta f (\theta_0)$ is the partial derivative $\frac{\partial f(\theta_0)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

Gradient Descent: Intuition

Gradient descent is a very common optimization algorithm used in machine learning.

The intuition behind gradient descent is to repeatedly obtain the gradient to determine the direction in which the function decreases most steeply and take a step in that direction.

Gradient Descent: Notation

More formally, if we want to optimize $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update until $\theta$ is no longer changing: $$ \theta_i := \theta_{i-1} - \alpha \cdot \nabla_\theta J(\theta_{i-1}). $$

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * gradient(theta_prev)

In the above algorithm, we stop when $||\theta_i - \theta_{i-1}||$ is small.

It's easy to implement this function in numpy.

We can now visualize gradient descent.

Part 2: Gradient Descent in Linear Models

Let's now use gradient descent to derive a supervised learning algorithm for linear models.

Review: Gradient Descent

If we want to optimize $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update: $$ \theta_i := \theta_{i-1} - \alpha \cdot \nabla_\theta J(\theta_{i-1}). $$

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * gradient(theta_prev)

Review: Linear Model Family

Recall that a linear model has the form \begin{align*} y & = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2 + ... + \theta_d \cdot x_d \end{align*} where $x \in \mathbb{R}^d$ is a vector of features and $y$ is the target. The $\theta_j$ are the parameters of the model.

By using the notation $x_0 = 1$, we can represent the model in a vectorized form $$ f_\theta(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x. $$

Let's define our model in Python.

An Objective: Mean Squared Error

We pick $\theta$ to minimize the mean squared error (MSE). Slight variants of this objective are also known as the residual sum of squares (RSS) or the sum of squared residuals (SSR). $$J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2$$ In other words, we are looking for the best compromise in $\theta$ over all the data points.

Let's implement mean squared error.

Mean Squared Error: Partial Derivatives

Let's work out what a partial derivative is for the MSE error loss for a linear model.

\begin{align*} \frac{\partial J(\theta)}{\partial \theta_j} & = \frac{\partial}{\partial \theta_j} \frac{1}{2} \left( f_\theta(x) - y \right)^2 \\ & = \left( f_\theta(x) - y \right) \cdot \frac{\partial}{\partial \theta_j} \left( f_\theta(x) - y \right) \\ & = \left( f_\theta(x) - y \right) \cdot \frac{\partial}{\partial \theta_j} \left( \sum_{k=0}^d \theta_k \cdot x_k - y \right) \\ & = \left( f_\theta(x) - y \right) \cdot x_j \end{align*}

Mean Squared Error: The Gradient

We can use this derivation to obtain an expression for the gradient of the MSE for a linear model \begin{align*} \nabla_\theta J (\theta) = \begin{bmatrix} \frac{\partial f(\theta)}{\partial \theta_1} \ \frac{\partial f(\theta)}{\partial \theta_2} \ \vdots \ \frac{\partial f(\theta)}{\partial \theta_d}

\end{bmatrix}

\begin{bmatrix} \left( f_\theta(x) - y \right) \cdot x_1 \\ \left( f_\theta(x) - y \right) \cdot x_2 \\ \vdots \\ \left( f_\theta(x) - y \right) \cdot x_d \end{bmatrix}

= \left( f_\theta(x) - y \right) \cdot \bf{x} . \end{align*}

Let's implement the gradient.

The UCI Diabetes Dataset

In this section, we are going to again use the UCI Diabetes Dataset.

Gradient Descent for Linear Regression

Putting this together with the gradient descent algorithm, we obtain a learning method for training linear models.

theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * (f(x, theta)-y) * x

This update rule is also known as the Least Mean Squares (LMS) or Widrow-Hoff learning rule.

Part 3: Ordinary Least Squares

In practice, there is a more effective way than gradient descent to find linear model parameters.

We will see this method here, which will lead to our first non-toy algorithm: Ordinary Least Squares.

Review: The Gradient

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta_0$ as

$$ \nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.$$

In other words, the $j$-th entry of the vector $\nabla_\theta f (\theta_0)$ is the partial derivative $\frac{\partial f(\theta_0)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

The UCI Diabetes Dataset

In this section, we are going to again use the UCI Diabetes Dataset.

Notation: Design Matrix

Machine learning algorithms are most easily defined in the language of linear algebra. Therefore, it will be useful to represent the entire dataset as one matrix $X \in \mathbb{R}^{n \times d}$, of the form: $$ X =