Lecture 3: Optimization and Linear Regression¶

Applied Machine Learning¶

Volodymyr Kuleshov
Cornell Tech

Part 1: Optimization and Calculus Background¶

In the previous lecture, we learned what is a supervised machine learning problem.

Before we turn our attention to Linear Regression, we will first dive deeper into the question of optimization.

Review: Components of A Supervised Machine Learning Problem¶

At a high level, a supervised machine learning problem has the following structure:

$$\text{Dataset} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model}$$

The predictive model is chosen to model the relationship between inputs and targets. For instance, it can predict future targets.

Optimizer: Notation¶

At a high-level an optimizer takes

• an objective $J$ (also called a loss function) and
• a model class $\mathcal{M}$ and finds a model $f \in \mathcal{M}$ with the smallest value of the objective $J$.
\begin{align*} \min_{f \in \mathcal{M}} J(f) \end{align*}

Intuitively, this is the function that bests "fits" the data on the training dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)}) \mid i = 1,2,...,n\}$.

We will use the a quadratic function as our running example for an objective $J$.

We can visualize it.

Calculus Review: Derivatives¶

Recall that the derivative $$\frac{d f(\theta_0)}{d \theta}$$ of a univariate function $f : \mathbb{R} \to \mathbb{R}$ is the instantaneous rate of change of the function $f(\theta)$ with respect to its parameter $\theta$ at the point $\theta_0$.

Calculus Review: Partial Derivatives¶

The partial derivative $$\frac{\partial f(\theta_0)}{\partial \theta_j}$$ of a multivariate function $f : \mathbb{R}^d \to \mathbb{R}$ is the derivative of $f$ with respect to $\theta_j$ while all othe other inputs $\theta_k$ for $k\neq j$ are fixed.

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta_0$ as

$$\nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.$$

The $j$-th entry of the vector $\nabla_\theta f (\theta_0)$ is the partial derivative $\frac{\partial f(\theta_0)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

We will use a quadratic function as a running example.

Let's visualize this function.

Let's write down the derivative of the quadratic function.

We can visualize the derivative.

Next, we will use gradients to define an important algorithm called gradient descent.

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta_0$ as

$$\nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.$$

The $j$-th entry of the vector $\nabla_\theta f (\theta_0)$ is the partial derivative $\frac{\partial f(\theta_0)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

Gradient descent is a very common optimization algorithm used in machine learning.

The intuition behind gradient descent is to repeatedly obtain the gradient to determine the direction in which the function decreases most steeply and take a step in that direction.

More formally, if we want to optimize $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update until $\theta$ is no longer changing: $$\theta_i := \theta_{i-1} - \alpha \cdot \nabla_\theta J(\theta_{i-1}).$$

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)


In the above algorithm, we stop when $||\theta_i - \theta_{i-1}||$ is small.

It's easy to implement this function in numpy.

We can now visualize gradient descent.

Part 2: Gradient Descent in Linear Models¶

Let's now use gradient descent to derive a supervised learning algorithm for linear models.

If we want to optimize $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update: $$\theta_i := \theta_{i-1} - \alpha \cdot \nabla_\theta J(\theta_{i-1}).$$

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)


Review: Linear Model Family¶

Recall that a linear model has the form \begin{align*} y & = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2 + ... + \theta_d \cdot x_d \end{align*} where $x \in \mathbb{R}^d$ is a vector of features and $y$ is the target. The $\theta_j$ are the parameters of the model.

By using the notation $x_0 = 1$, we can represent the model in a vectorized form $$f_\theta(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x.$$

Let's define our model in Python.

An Objective: Mean Squared Error¶

We pick $\theta$ to minimize the mean squared error (MSE). Slight variants of this objective are also known as the residual sum of squares (RSS) or the sum of squared residuals (SSR). $$J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2$$ In other words, we are looking for the best compromise in $\theta$ over all the data points.

Let's implement mean squared error.

Mean Squared Error: Partial Derivatives¶

Let's work out what a partial derivative is for the MSE error loss for a linear model.

\begin{align*} \frac{\partial J(\theta)}{\partial \theta_j} & = \frac{\partial}{\partial \theta_j} \frac{1}{2} \left( f_\theta(x) - y \right)^2 \\ & = \left( f_\theta(x) - y \right) \cdot \frac{\partial}{\partial \theta_j} \left( f_\theta(x) - y \right) \\ & = \left( f_\theta(x) - y \right) \cdot \frac{\partial}{\partial \theta_j} \left( \sum_{k=0}^d \theta_k \cdot x_k - y \right) \\ & = \left( f_\theta(x) - y \right) \cdot x_j \end{align*}

We can use this derivation to obtain an expression for the gradient of the MSE for a linear model \begin{align*} \nabla_\theta J (\theta) = \begin{bmatrix} \frac{\partial f(\theta)}{\partial \theta_1} \ \frac{\partial f(\theta)}{\partial \theta_2} \ \vdots \ \frac{\partial f(\theta)}{\partial \theta_d}

\end{bmatrix}¶

\begin{bmatrix} \left( f_\theta(x) - y \right) \cdot x_1 \\ \left( f_\theta(x) - y \right) \cdot x_2 \\ \vdots \\ \left( f_\theta(x) - y \right) \cdot x_d \end{bmatrix}

= \left( f_\theta(x) - y \right) \cdot \bf{x} . \end{align*}

The UCI Diabetes Dataset¶

In this section, we are going to again use the UCI Diabetes Dataset.

• For each patient we have a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score (from 0-300).
• We are interested in understanding how BMI affects an individual's diabetes risk.

Putting this together with the gradient descent algorithm, we obtain a learning method for training linear models.

theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:
theta_prev = theta
theta = theta_prev - step_size * (f(x, theta)-y) * x


This update rule is also known as the Least Mean Squares (LMS) or Widrow-Hoff learning rule.

Part 3: Ordinary Least Squares¶

In practice, there is a more effective way than gradient descent to find linear model parameters.

We will see this method here, which will lead to our first non-toy algorithm: Ordinary Least Squares.

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta_0$ as

$$\nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.$$

In other words, the $j$-th entry of the vector $\nabla_\theta f (\theta_0)$ is the partial derivative $\frac{\partial f(\theta_0)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

The UCI Diabetes Dataset¶

In this section, we are going to again use the UCI Diabetes Dataset.

• For each patient we have a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score (from 0-300).
• We are interested in understanding how BMI affects an individual's diabetes risk.

Notation: Design Matrix¶

Machine learning algorithms are most easily defined in the language of linear algebra. Therefore, it will be useful to represent the entire dataset as one matrix $X \in \mathbb{R}^{n \times d}$, of the form:  X =