Previously, we derived maximum likelihood learning as a general way of learning machine models.
We will now seehow the algorithms we've seen so far are special cases of this principle.
A probabilistic model is a probability distribution $$P(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1].$$ This model can approximate the data distribution $P_\text{data}(x,y)$.
If we know $P(x,y)$, we can use the conditional $P(y|x)$ for prediction.
Probabilistic models may also have parameters $\theta \in \Theta$, which we denote as $$P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1].$$
A general approach of optimizing conditional models of the form $P_\theta(y|x)$ is by minimizing expected KL divergence with respect to the data distribution: $$ \min_\theta \mathbb{E}_{x \sim \mathbb{P}_\text{data}} \left[ D(P_\text{data}(y|x) \mid\mid P_\theta(y|x)) \right]. $$
With a bit of math, we can show that the maximum likelihood objective becomes $$ \max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(y|x). $$ This is the principle of conditional maximum likelihood.
Recall that the linear regression algorithm fits a linear model of the form $$ f(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x. $$
It minimizes the mean squared error (MSE) $$J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2$$ on a dataset $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.
Is there a specific reason for us to be optimizing the mean squared error to fit our linear model?
The answer to this can be found by looking at the algorithm from a probabilistic perspective.
Let's derive a probabilistic algorithm by defining a class of probabilistic models and use maximum likelihood as our objective.
Note how this is a mean squared error (MSE) objective!
Thus, minimizing MSE is equivalent to maximizing the log-likelihood of a Normal distribution $\mathcal{N}(y; \mu(x), \sigma)$.
This is an example of how we can interpret a machine learning algorithm in a probabilistic framework.
We will see many algorithms that have these kinds of interpretations. Here are some simple extensions.
We can use a Gaussian model and also parametrize the standard deviation.
We can can also parametrize other distributions, not just the Gaussian.
This yields many new machine learning algorithms.
We can also use what we learned about Bayesian ML do interpret several algrothims that we've seen as special cases of the Bayesian framework.
In Bayesian statistics, $\theta$ is a random variable whose value happens to be unknown.
We formulate two models:
Together, these two models define the joint distribution $$ P(x, y, \theta) = P(x, y \mid \theta) P(\theta) $$ in which both the $x, y$ and the parameters $\theta$ are random variables.
Recall that in maximum a posteriori (MAP) learning, we optimize the following objective. \begin{align*} \theta_\text{MAP} = \arg\max_\theta \left( \log \prod_{i=1}^n P(x^{(i)}, y^{(i)} \mid \theta) + \log P(\theta) \right), \end{align*}
Note that we used the same formula as we used for maximum likelihood, except that we have added the prior term $\log P(\theta)$.
Recall that the ridge regression algorithm fits a linear model $$ f(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x. $$
We minimize the L2-regualrized mean squared error (MSE) $$J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2 + \frac{\lambda}{2}\sum_{j=1}^d \theta_j^2$$ on a dataset $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. The term $\frac{1}{2}\sum_{j=1}^d \theta_j^2 = \frac{1}{2}||\theta||_2^2$ is called the regularizer.
We can interpet ridge regression as maximum apriori (MAP) estimation as follows.
Thus, we see that ridge regression actually amounts to performing MAP estimation with a Gaussian prior. The strength of the regularizer $\lambda$ equals $1/\tau^2$.
Very often, we can interpret classical ML algorithms as applications of the probabilistic or Bayesian approaches (although we can derive them in other ways as well!)
Let's now look at an example of a fully Bayesian machinne learning algorithm.
This section is still under construction and not part of the main lecture.
In Bayesian statistics, $\theta$ is a random variable whose value happens to be unknown.
We formulate two models:
Together, these two models define the joint distribution $$ P(x, y, \theta) = P(x, y \mid \theta) P(\theta) $$ in which both the $x, y$ and the parameters $\theta$ are random variables.
Recall that the ridge regression algorithm fits a linear model $$ f(x) = \sum_{j=0}^d \theta_j \cdot x_j = \theta^\top x. $$
We minimize the L2-regualrized mean squared error (MSE) $$J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y_i-x_i^\top\theta)^2 + \frac{1}{2}\sum_{j=1}^d \theta_j^2$$ on a dataset $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. The term $\frac{1}{2}\sum_{j=1}^d \theta_j^2 = \frac{1}{2}||\theta||_2^2$ is called the regularizer.
We can interpet ridge regression as maximum apriori (MAP) estimation as follows.
Suppose we now want to predict the value of $y$ from $x$. Unlike in the frequentist setting, we no longer have a single estimate $\theta$ of the model params, but instead we have a distribution.
The Bayesian approach to predicting $y$ given an input $x$ and a training dataset $\mathcal{D}$ consists of taking the prediction of all the possible models $$ P(y | x, \mathcal{D}) = \int_\theta P(y \mid x, \theta) P(\theta \mid \mathcal{D}) d\theta. $$ This is called the posterior predictive distribution. Note how each $P(y \mid x, \theta)$ is weighted by the probability of $\theta$ given $\mathcal{D}$.