Lecture 23: Course Overview

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Congratulations on Finishing Applied Machine Learning!

You have made it! This is our last machine learning lecture, in which we will do an overview of the diffrent algorithms seen in the course.

A Map of Applied Machine Learning

We will go through the following map of algorithms from the course.

Supervised Machine Learning

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Dataset}}_\text{Features, Attributes} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer} \to \text{Predictive Model} $$

The predictive model is chosen to model the relationship between inputs and targets. For instance, it can predict future targets.

Linear Regression

In linear regression, we fit a model $$ f_\theta(x) := \theta^\top \phi(x) $$ that is linear in $\theta$.

The features $\phi(x) : \mathbb{R} \to \mathbb{R}^p$ are non-linear may non-linear in $x$ (e.g., polynomial features), allowing us to fit complex functions.

Overfitting

Overfitting is one of the most common failure modes of machine learning.

Regularization

The idea of regularization is to penalize complex models that may overfit the data.

Regularized least squares optimizes the following objective (Ridge). $$ J(\theta) = \frac{1}{2n} \sum_{i=1}^n \left( y^{(i)} - \theta^\top \phi(x^{(i)}) \right)^2 + \frac{\lambda}{2} \cdot ||\theta||_2^2. $$ If we use the L1 norm, we have the LASSO.

Regression vs. Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

Parametric vs. Non-Parametric Models

Nearest neighbors is an example of a non-parametric model.

Probabilistic vs. Non-Probabilistic Models

A probabilistic model is a probability distribution $$P(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}(x,y)$.

If we know $P(x,y)$, we can use the conditional $P(y|x)$ for prediction.

Maximum Likelihood Learning

Maximum likelihood is an objective that can be used to fit any probabilistic model: $$ \theta_\text{MLE} = \arg\max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P(x, y; \theta). $$ It minimizes the KL divergence between the model and data distributions: $$\theta_\text{MLE} = \arg\min_\theta \text{KL}(P_\text{data} \mid\mid P_\theta).$$

Discriminative vs. Generative Models

There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

We can obtain predictions from generative models via $\max_y P_\theta(x,y)$.

The Max-Margin Principle

Intuitively, we want to select linear decision boundaries with high margin.

This means that we are as confident as possible for every point and we are as far as possible from the decision boundary.

The Kernel Trick

Many algorithms in machine learning only involve dot products $\phi(x)^\top \phi(z)$ but not the features $\phi$ themselves.

We can often compute $\phi(x)^\top \phi(z)$ very efficiently for complex $\phi$ using a kernel function $K(x,z) = \phi(x)^\top \phi(z)$. This is the kernel trick.

Tree-Based Models

Decision trees output target based on a tree of human-interpretable decision rules.

Neural Networks

Neural network models are inspired by the brain.

Unsupervised Learning

We have a dataset without labels. Our goal is to learn something interesting about the structure of the data:

How To Decide Which Algorithm to Use

One factor is how much data you have. In the small data (<10,000) regime, consider:

In the big data regime,

Some additional advice:

What's Next? Ideas for Courses

Consider the following courses to keep learning about ML:

What's Next? Ideas for Research

In order to get involved in research, I recommend:

What's Next? Ideas for Industry Projects

Finally, a few ideas for how to get more practice applying ML in the real world:

Thank You For Taking Applied Machine Learning!