Lecture 15: Deep Learning

Applied Machine Learning

Volodymyr Kuleshov, Jin Sun
Cornell Tech

Part 1: What is Deep Learning?

Deep learning is a relatively new and powerful subfield of machine learning closely tied to neural networks.

Let's find out what deep learning is, and then we will see some deep learning algorithms.

Review: Neural Network Layers

A neural network layer is a model $f : \mathbb{R}^d \to \mathbb{R}^p$ that applies $p$ neurons in parallel to an input $x$. $$f(x) = \sigma(W\cdot x) = \begin{bmatrix} \sigma(w_1^\top x) \\ \sigma(w_2^\top x) \\ \vdots \\ \sigma(w_p^\top x) \end{bmatrix}, $$ where each $w_k$ is the vector of weights for the $k$-th neuron and $W_{kj} = (w_k)_j$. We refer to $p$ as the size of the layer.

Review: Neural Networks

A neural network is a model $f : \mathbb{R} \to \mathbb{R}$ that consists of a composition of $L$ neural network layers: $$ f(x) = f_L \circ f_{L-1} \circ \ldots f_1 (x). $$ The final layer $f_L$ has size one (assuming the neural net has one ouput); intermediary layers $f_l$ can have any number of neurons.

The notation $f \circ g(x)$ denotes the composition $f(g(x))$ of functions

We can visualize this graphically as follows.

What is Deep Learning?

In a nutshell, deep learning is a modern evolution of the field of artificial neural networks that emphasizes:

Expressivity of Deep Models

Why is deep learning powerful? One reason is deep neural networks can represent complex models very compactly.

In practice, deep neural networks can learn very complex mappings such as $\text{image} \to \text{text description}$ that other algorithms cannot.

Representation Learning

How does a deep neural network use its representational power?

This is an example of representations that a deep neural network learns from data.

Scaling to Large Datasets

Deep learning models also scale to very large datasets.

Classical algorithms like linear regression saturate after a certain dataset size. Deep learning models keep improving as we add more data.

Computational Scalability

Deep learning datasets benefit from large datasets because they can easily use specialzied computational hardware.

Connections to Neuroscience

As artificial neurons are designed to mimic the biological ones, deep neural networks have strong connections with biological neural systems.

One of the most interesting ones is the visual processing network in brains.

There is ample evidence that deep neural networks perform computation very simiilar to that of the visual cortex (Kravitz el al.).

Successes of Deep Learning

Deep learning has been a real breakthrough in machine learning over the last decade.

One really recent impressive application is image generation.

These faces are not real! See: https://thispersondoesnotexist.com/

Pros and Cons of Deep Neural Networks

Deep neural nets (DNNs) are among the most powerful ML models.

However, DNNs also can be slow and hard to train and require a lot of data.

Try it out! https://transformer.huggingface.co/doc/gpt2-large

GPT-3 is more powerful and able to generate structured text such as HTML/CSS, tables, Python programs, food recipes and much more.

https://github.com/elyase/awesome-gpt3

Challenges of Deep Learning

Deep learning is powerful, but it can be challenging to apply it to real-world problems.

Next we are going to look into a particular type of deep neural networks that is widely used for static data such as images: Convolutional Neural Networks (CNN).

But first we need to review the foundational operations of CNNs: Convolutions and Pooling.

Part 2: Convolutions and Pooling

These basic operations are the building blocks of modern convolutional neural networks.

Review: What is Deep Learning?

In a nutshell, deep learning is a modern evolution of the field of artificial neural networks that emphasizes:

Definition of Convolution

Let $f \in \mathbb{R}^n$ and $g \in \mathbb{R}^m$ be two vectors, called the filter and the signal respectively. Typically, $n < m$.

In deep learning, a convolution $(f * g ) : \mathbb{Z} \to \mathbb{R}$ is typically defined as \begin{align} (f * g ) [p] &\triangleq \underbrace{\sum_{t=1}^{n} f[t] g[p+t]}_\text{dot product of $f$ with part of $g$} \end{align} where $g[t] = 0$ when $t \leq 0$ or $t>m$.

This is best understood via a picture:

The green sequence [1,0,-1] is the filter. The signal $g$ is in gray, and the outputs of the convolution $$(f * g)[p] = 1 \cdot g_p + 0 \cdot g_{p+1} -1 \cdot g_{p+2}$$ are in yellow.

On a small technical note, what we have defined is called the cross-correlation in mathematics.

The convolution is technically defined as $\sum_{t=1}^{n} f[t] g[p-t]$, but in deep learning both formulas effectively give the same results and the cross-correlation is used in practice (and is called "convolution").

We can implement a convolution in Python as follows.

Example: Edge Detection

To gain more intuition about convolution, let's look at another example in 1D.

We start by defining a filter and a signal as numpy arrays.

Here, the signal has "jumps" or "edges" between 0 and 1 at a few locations.

The filter equals $\pm1$ at these edges, and zero everywhere else. If $g$ was an audio signal, then $f$ would detect boundaries between silence.

Example: Smoothing

Another application of convolutions is to smooth the signal $g$.

We again define a filter and a signal as numpy arrays.

The filter $f$ is a "mini bell curve". When applied to the same signal, it smoothes-out the "kinks".

What Are Convolutions Doing?

The convolution applies the filter $f$ onto the signal $g$.

We can use convolutions in machine learning models to learn the filters.

Convolutions in 2D

Convolutions can be extended to 2D by expanding dimensions of $f$ and $g$.


In this example, $f$ is a 3x3 array (gray area). $g$ is a 5x5 array (blue area). The result of the convolution is the top green area.

In the context of neural networks, $f$ are learnable weights, $g$ are inputs from previous layers.

The results of applying the convolution are fed to the next layer.

Review: Neural Network Layers

A neural network layer is a model $f : \mathbb{R}^d \to \mathbb{R}^p$ that applies $p$ neurons ih parallel to an input $x$. $$f(x) = \sigma(W\cdot x) = \begin{bmatrix} \sigma(w_1^\top x) \\ \sigma(w_2^\top x) \\ \vdots \\ \sigma(w_p^\top x) \end{bmatrix}, $$ where each $w_k$ is the vector of weights for the $k$-th neuron and $W_{kj} = (w_k)_j$. We refer to $p$ as the size of the layer.

Convolutional Layers

A convolution layer is a model $f : \mathbb{R}^d \to \mathbb{R}^p$ that applies $p$ convolutions in parallel to an input $x$. $$f(x) = \text{conv}(W, x) = \begin{bmatrix} \text{conv}(w_1, x) \\ \text{conv}(w_2, x) \\ \vdots \\ \text{conv}(w_p, x) \\ \end{bmatrix}, $$ where each $w_k$ is the weights for the $k$-th convolution filter (think $f$ in our 1D example).

Understanding 2D Convolutional Layers

Each convolution filter in the conv layer produces an output map called the activation map.

A different filter produces a different activation map.

Each filter produces its own activation map.

All activation maps in the same conv layer are concatenated together to form the final multi channel output:

The activation map contains useful information about the input image.

Attributes of Convolutional Layers

The most important attributes of a convolutional layer are:

Example: Edge Detection in 2D

Let's revisit our edge detection example, but this time the signal $g$ will be a 2D image.

We will use two filters, each of which is defined below. We will also load an image as the signal $g$.

We can now convolve and visualize the filters with the image.

The result is an activation map with high activation around horizontal and vertical edges -- parts of the image where it changes from light to dark.

When to Use Convolutions?

Convolutions work best on inputs containing "attributes" that are interesting, but whose location in the input is not important.

Images are by nature good examples of this assumption. Convolution layers can learn similar functions as fully connected layers, but with much smaller parameters.

Pooling Layers

A pooling layer is a model $f : \mathbb{R}^d \to \mathbb{R}^p$ that applies pooling operations to an input $x$. $$f(x) = \text{pooling}(x) $$ where $pooling$ is a pre-defined operation applied over the input $x$.

Pooling layer does not have learnable weights.

Pooling

Pooling is a common operation to be added after a convolution layer. It is applied to each activation map separately and reduces the spatial size of its input.

Max Pooling: for each region in the input, return the max value. This is the most common type.

Average Pooling: for each region in the input, return the average value.

Purpose of Pooling

Part 3: Convolutional Neural Networks

Convolutional neural networks use convolutional layers to process signals such as images, audio, and even text.

Review: Convolution

Convolutions apply a filter $f$ (gray) over a signal $g$ (blue). The output is an activation map (green).

Review: Pooling

Pooling is a common operation to be added after a convolution layer. It is applied to each activation map separately and reduces the spatial size of its input.

Convolutional Neural Networks

A convolutional neural network (CNN) is a model $f : \mathbb{R} \to \mathbb{R}$ that consists of a composition of $L$ neural network layers that contain convolutions: $$ f(x) = f_L \circ f_{L-1} \circ \ldots f_1 (x). $$

The final $f_L$ is often a fully connected output later of size one.

Typically, CNNs are made of consecutive convolution + activation + pooling layers that form into blocks.

Next we are going to see a few famous examples.

LeNet for MNIST Digits Recognition

LeNet successfully used CNNs for digit recognition [LeCun, Bottou, Bengio, Haffner 1998].

AlexNet and ImageNet

ImageNet classification with deep convolutional neural networks [Krizhevsky, Sutskever, Hinton 2012]

Starting from AlexNet, the performance of image classification on ImageNet has entered a new era:

http://sqlml.azurewebsites.net/2017/09/12/convolutional-neural-network/

Convolutional neural networks with deeper layers (and other tricks) have already exceeded human performance on such tasks.

Algorithm: Convolutional Neural Network

CNNs and Feature Learning

Before neural networks, computer vision algorithms used hand-crafted features to extract information from raw image inputs.

Designing such features is hard and most importantly they might not be optimal for a particular task.

One set of features that used to be popular was the Scale-Invariant Feature Transform (SIFT).


https://www.vlfeat.org/overview/sift.html

Another common set of features was the the Histogram of Oriented Gradients (HOG).

https://sarthakahuja.org/public/docs/report_ped_detection.pdf

By using CNNs, we now can extract features from input images at different level of abstraction, without the need of designing hand-craft features.

Visualizing CNN Internals

CNNs can be seen as extracting features at low-, mid-, and high-levels of abstraction that are relevant to corresponding visual concepts.

Below, we reproduce feature visualization from the paper Visualizing and Understanding Convolutional Networks [Zeiler, Fergus 2013].

For a convolutional network with

$$ \text{Input} \rightarrow \text{Conv Layer1} \rightarrow \text{Conv Layer2} \rightarrow \cdots \rightarrow \text{Conv Layer5} \rightarrow \cdots \rightarrow \text{Output,}$$

we can visualize what each layer is doing by finding the image patch with highest activation responses.

Layer 1 focuses on colored edges and blobs.

Layers 2-3 focus on object parts, such as the wheels of a car, or the beak of a bird.

Layers 4-5 focus on object parts and even entire objects.

Pros and Cons of CNNs:

CNNs are powerful tools because they encode visual information efficiently.

Their main drawbacks are computational and data requirements.