Lecture 14: Neural Networks

Applied Machine Learning

Volodymyr Kuleshov, Jin Sun
Cornell Tech

Part 1: An Artifical Neuron

In this lecture, we will learn about a new class of machine learning algorithms inspired by the brain.

We will start by defining a few building blocks for these algorithms, and draw connections to neuroscience.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

Review: Binary Classification

In supervised learning, we fit a model of the form $$ f : \mathcal{X} \to \mathcal{Y} $$ that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

In classification, the space of targets $\mathcal{Y}$ is discrete. Classification is binary if $\mathcal{Y} = \{0,1\}$

Review: Logistic Regression

Logistic regression fits a model of the form $$ f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$ where $$ \sigma(z) = \frac{1}{1 + \exp(-z)} $$ is known as the sigmoid or logistic function.

A Biological Neuron

In order to define an artifical neuron, let's look first at a biological one.

An Artificial Neuron: Example

We can imitate this machinery using an idealized artifical neuron.

An Artificial Neuron: Notation

More formally, we say that a neuron is a model $f : \mathbb{R}^d \to [0,1]$, with the following components:

Perceptron

If we use a step function as the activation function, we obtain the classic Perceptron model:

$$ f(x) = \begin{cases} 1 & \text{if $\theta^\top x>0$}, \\ 0 & \text{otherwise} \end{cases} $$

This models a neuron that fires if the inputs are sufficiently large, and doesn't otherwise.

We can visualize the activation function of the Perceptron.

Logistic Regression as an Artifical Neuron

Logistic regression is a model of the form $$ f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$ that can be interpreted as a neuron that uses the sigmoid as the activation function.

The sigmoid activation function encodes the idea of a neuron firing if the inputs exceed a threshold, makes make the activation function "smooth".

Activation Functions

There are many other activation functions that can be used. In practice, these two work better than the sigmoid:

We can easily visualize these.

Classification Dataset: Iris Flowers

To demonstrate classification algorithms, we are going to use the Iris flower dataset.

We are going to define an artificial neuron for the binary classification problem (class-0 vs the rest).

This is a visualization of the dataset.

Below, we define neuron with a sigmoid activation function (and its gradient).

We can optimize is using gradient descent.

This neuron learns a linear decision boundary that separates the data.

Algorithm: Artificial Neuron

Part 2: Artificial Neural Networks

Let's now see how we can connect neurons into networks that form complex models that further mimic the brain.

Review: Artificial Neuron

We say that a neuron is a model $f : \mathbb{R}^d \to [0,1]$, with the following components:

Review: Logistic Regression as Neuron

Logistic regression is a model of the form $$ f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}, $$ that can be interpreted as a neuron that uses the sigmoid as the activation function.

Neural Networks: Intuition

A neural network is a directed graph in which a node is a neuron that takes as input the outputs of the neurons that are connected to it.

Networks are typically organized in layers.

Neural Networks: Layers

A neural network layer is a model $f : \mathbb{R}^d \to \mathbb{R}^p$ that applies $p$ neurons in parallel to an input $x$. $$ f(x) = \begin{bmatrix} \sigma(w_1^\top x) \\ \sigma(w_2^\top x) \\ \vdots \\ \sigma(w_p^\top x) \end{bmatrix}. $$ where each $w_k$ is the vector of weights for the $k$-th neuron. We refer to $p$ as the size of the layer.

The first output of the layer is a neuron with weights $w_1$:

The second neuron has weights $w_2$:

The third neuron has weights $w_3$:


The parameters of the layer are $w_1, w_2, w_3$.

By combining the $w_k$ into one matrix $W$, we can write in a more succinct vectorized form: $$f(x) = \sigma(W\cdot x) = \begin{bmatrix} \sigma(w_1^\top x) \\ \sigma(w_2^\top x) \\ \vdots \\ \sigma(w_p^\top x) \end{bmatrix}, $$ where $\sigma(W\cdot x)_k = \sigma(w_k^\top x)$ and $W_{kj} = (w_k)_j$.

Visually, we can represent this as follows:

Neural Networks: Notation

A neural network is a model $f : \mathbb{R}^d \to \mathbb{R}$ that consists of a composition of $L$ neural network layers: $$ f(x) = f_L \circ f_{L-1} \circ \ldots f_l \circ \ldots f_1 (x). $$ The final layer $f_L$ has size one (assuming the neural net has one ouput); intermediary layers $f_l$ can have any number of neurons.

The notation $f \circ g(x)$ denotes the composition $f(g(x))$ of functions.

We can visualize this graphically as follows.

Example of a Neural Network

Let's implement a small two layer neural net with 3 hidden units.

This implementation looks as follows.

Later in this lecture, we will see how to train this model using gradient descent.

Types of Neural Network Layers

There are many types of neural network layers that can exist. Here are a few:

Algorithm: (Fully-Connected) Neural Network

Pros and Cons of Neural Nets

Neural networks are very powerful models.

They also have important drawbacks.

Part 3: Backpropagation

We have defined what is an artificial neural network.

Let's now see how we can train it so that it performs well on given tasks.

Review: Neural Network Layers

A neural network layer is a model $f : \mathbb{R}^d \to \mathbb{R}^p$ that applies $p$ neurons in parallel to an input $x$. $$f(x) = \sigma(W\cdot x) = \begin{bmatrix} \sigma(w_1^\top x) \\ \sigma(w_2^\top x) \\ \vdots \\ \sigma(w_p^\top x) \end{bmatrix}, $$ where each $w_k$ is the vector of weights for the $k$-th neuron and $W_{kj} = (w_k)_j$. We refer to $p$ as the size of the layer.

Review: Neural Networks

A neural network is a model $f : \mathbb{R} \to \mathbb{R}$ that consists of a composition of $L$ neural network layers: $$ f(x) = f_L \circ f_{L-1} \circ \ldots f_1 (x). $$ The final layer $f_L$ has size one (assuming the neural net has one ouput); intermediary layers $f_l$ can have any number of neurons.

The notation $f \circ g(x)$ denotes the composition $f(g(x))$ of functions

We can visualize this graphically as follows.

Review: The Gradient

The gradient $\nabla_\theta f$ further extends the derivative to multivariate functions $f : \mathbb{R}^d \to \mathbb{R}$, and is defined at a point $\theta$ as $$ \nabla_\theta f (\theta) = \begin{bmatrix} \frac{\partial f(\theta)}{\partial \theta_1} \\ \frac{\partial f(\theta)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta)}{\partial \theta_d} \end{bmatrix}.$$ In other words, the $j$-th entry of the vector $\nabla_\theta f (\theta)$ is the partial derivative $\frac{\partial f(\theta)}{\partial \theta_j}$ of $f$ with respect to the $j$-th component of $\theta$.

Review: Gradient Descent

If we want to optimize an objective $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update until the function is no longer decreasing: $$ \theta_i := \theta_{i-1} - \alpha \cdot \nabla_\theta J(\theta_{i-1}). $$

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * gradient(theta_prev)

Backpropagation

To apply gradient descent, we need to calculate gradients for every parameter in a neural network model $f$: $$\frac{\partial J}{\partial \theta_0}, \frac{\partial J}{\partial \theta_1}, \cdots, \frac{\partial J}{\partial \theta_d}$$

It might be possible to do it manually when the network is small. But it is nearly impossible and very much error-prone to compute gradients for larger networks.

Backpropagation is a way of calculating gradients efficiently for neural network models with arbitrary number of layers and neurons.

The core idea of it is something we are actually very familiar with: the chain rule.

Review: Chain Rule of Calculus

If we have two differentiable functions $f(x)$ and $g(x)$, and $$F(x) = f \circ g (x)$$ then the derivative of $F(x)$ is: $$ F^\prime (x) = f^\prime (g(x)) \cdot g^\prime (x).$$

Let $y=f(u)$ and $u=g(x)$, we also have: $$ \frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}.$$

Chain Rule in Neural Nets

A neural network is a model $f : \mathbb{R} \to \mathbb{R}$ that consists of a composition of $L$ neural network layers: $$ f(x) = f_L \circ f_{L-1} \circ \ldots f_1 (x). $$

Let $y_l$ denote the output $f_l \circ f_{-1} \circ f_1(x)$ of layer $l$.

The chain rule tells us to compute $\frac{\partial J}{\partial \theta_l}$ for all parameters $\theta_l$ in layer $l$. We can break the computation down as: $$ \frac{\partial J}{\partial \theta_l} = \frac{\partial J}{\partial y_L} \frac{\partial y_L}{\partial y_{L-1}} \cdots \frac{\partial y_{l+1}}{\partial y_l} \frac{\partial y_l}{\partial \theta_l}, $$ where $y_L, y_{L-1} \cdots y_l$ are the outputs from each layer.

Note that the computation of $\frac{\partial J}{\partial y_l}$ can be re-used for computing gradients for all $\theta$ in layers before $l$.

This is the key idea of backpropagation: local gradients computation for each layer can be 'chained' to obtain gradients.

Illustration:

Let's start with a single layer $y = f(x;\theta)$:

With the output $y$, target label $\hat{y}$, and loss function $J$, we can compute the loss (error) of the prediction.

The backpropagation starts from the output layer and moves backwards.

We first need to compute the gradients of the loss to the output.

After we have those, then using the chain rule, we can compute the gradients with respect to the network parameters $\theta$.

We can keep working upstream and compute gradients to the input. After that we finish the backpropagation in this layer.

As you can see, this operation can be easily extended to more layers between the input and output.

We can apply this process recursively to obtain derivatives for any number of layers.

Backprogragation by Hand

Let's work out by hand what backpropagation would do on our two layer neural network.

For our two layer fully connected network with sigmoid activation, the network is composed of following functions:

$$\mathbf{h} = \sigma(\mathbf{W}^T \mathbf{x})$$$$y = \sigma(\mathbf{V}^T \mathbf{h}),$$

where $\mathbf{x} = [x_1,x_2]^T, \mathbf{h} = [h_1,h_2,h_3]^T, \mathbf{W} \in \mathbb{R}^{2\times3}, \mathbf{V} \in \mathbb{R}^{3\times1}$, and $\sigma$ is the sigmoid function.

In our example, we have the following values:

$\mathbf{x} = [5.0,3.0]^T,~~~~\hat{y} = 1$ means it is positive class.

$\mathbf{W} = \begin{bmatrix} 1.0 & -1.0 & 3.0\\ 2.0 & 2.0 & -1.0 \end{bmatrix}$

$\mathbf{V} = [0.1,0.5,-0.1]^T$

We can compute the output of the hidden layer, $\mathbf{h}$:

$h_1 = \sigma (W_{11} \cdot x_1 + W_{21} \cdot x_2) = \sigma (1.0\times5.0 + 2.0\times3.0) = 0.99998329857$

We can compute the output of the hidden layer, $\mathbf{h}$: \begin{align*} h_1 &= \sigma (W_{11} \cdot x_1 + W_{21} \cdot x_2) = \sigma (1.0\times5.0 + 2.0\times3.0) = 0.9999 \\ h_2 &= \sigma (W_{12} \cdot x_1 + W_{22} \cdot x_2) = \sigma (-1.0\times5.0 + 2.0\times3.0) = 0.7310 \end{align*}


We can compute the output of the hidden layer, $\mathbf{h}$: \begin{align*} h_1 &= \sigma (W_{11} \cdot x_1 + W_{21} \cdot x_2) = \sigma (1.0\times5.0 + 2.0\times3.0) = 0.9999 \\ h_2 &= \sigma (W_{12} \cdot x_1 + W_{22} \cdot x_2) = \sigma (-1.0\times5.0 + 2.0\times3.0) = 0.7310 \\ h_3 &= \sigma (W_{13} \cdot x_1 + W_{23} \cdot x_2) = \sigma (3.0\times5.0 + -1.0\times3.0) = 0.9999 \end{align*}

Similarly we can get the output of $y$:

$y = \sigma (V_1 \cdot h_1 + V_2 \cdot h_2 + V_3 \cdot h_3) = 0.590378$

Next, we compute the binary cross entropy loss with the target label $1$: $J(y, \hat{y}) = - \log (y) = 0.52699$

We can also compute the gradient (shown in red): $\frac{\mathrm{d}J}{\mathrm{d}{y}} = - 1/y = -1.69383$

We are now ready to kick start the backpropagation steps.

We are going to compute the gradients to the weights in the output layer: $$\frac{\mathrm{d}J}{\mathrm{d}{V}} = \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\mathrm{d}y}{\mathrm{d}{V}}$$ Recall $y = \sigma(\mathbf{V}^T \mathbf{h}) = \sigma (V_1 \cdot h_1 + V_2 \cdot h_2 + V_3 \cdot h_3)$ and $\sigma' = \sigma(1-\sigma)$: $$\frac{\partial y}{\partial V_1} = y (1-y) h_1,~~\frac{\partial y}{\partial V_2} = y (1-y) h_2,~~\frac{\partial y}{\partial V_3} = y (1-y) h_3$$

Applying these formulas, we obtain the gradients of $\mathbf{V}$: \begin{align*} \frac{\partial J}{\partial V_1} &= \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\partial y}{\partial V_1} = -1.69 \times 0.59 \times (1-0.59) \times 0.99998 = -0.41\\ \frac{\partial J}{\partial V_2} &= \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\partial y}{\partial V_2} = -1.69 \times 0.59 \times (1- 0.59) \times 0.7311 = -0.30\\ \frac{\partial J}{\partial V_3} &= \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\partial y}{\partial V_3} = -1.69 \times 0.59 \times (1-0.59) \times 0.99999 = -0.41 \end{align*}

Next, we want to compute gradients at the hidden layer:

$$\frac{\mathrm{d}J}{\mathrm{d}{h}} = \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\mathrm{d}y}{\mathrm{d}{h}}$$

Similarly to the previous slide: \begin{align*} \frac{\partial J}{\partial h_1} &= \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\partial y}{\partial h_1} = -1.69 \times 0.59 \times (1-0.59) \times 0.1 = -0.04096\\ \frac{\partial J}{\partial h_2} &= \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\partial y}{\partial h_2} = -1.69 \times 0.59 \times (1- 0.59) \times 0.5 = -0.2048\\ \frac{\partial J}{\partial h_3} &= \frac{\mathrm{d}J}{\mathrm{d}{y}} \frac{\partial y}{\partial h_3} = -1.69 \times 0.59 \times (1-0.59) \times -0.1 = 0.04096 \end{align*}

Since we have another linear layer with sigmoid activation, the way we compute gradients will be the same as in the output layer. \begin{align*} \frac{\partial J}{\partial h_1} \frac{\partial h_1}{\partial W_{11}} &= -0.041 \times 0.99998 \times (1-0.99998) \times 5 = -3\times10^{-6} \\ \frac{\mathrm{d}J}{\mathrm{d}{h_1}} \frac{\partial h_1}{\partial W_{12}} &= -0.041 \times 0.99998 \times (1- 0.99998) \times 3 = -2\times10^{-6} \end{align*}

We can compute the rest the same way.

Note the gradients to the weights connecting to $h_2$ are larger in magnitude than others.

And now we have the gradients to all the learnable weights in this two layer network and we can tune the weights by gradient descenet.

The gradients tell us how much to change for each weight so that the loss will become smaller.

Now let's implement backprop with the simple neural network model we defined earlier.

We start by implementing the building block of our network: a linear layer with sigmoid activation.

Then we can stack the single layers to construct a two layer network.

We can run with our previous example to check if the results are consistent with our manual computation.

Another sanity check is to perform gradient descent on the single sample input and see if we can achieve close to zero loss.

You can try to change the target label below to see the network is able to adapt in either case.

Summary