Lecture 8: Naive Bayes

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Text Classification

We will now do a quick detour to talk about an important application area of machine learning: text classification.

Afterwards, we will see how text classification motivates new classification algorithms.

Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

Text Classification

An interesting instance of a classification problem is classifying text.

Classification Dataset: Twenty Newsgroups

To illustrate the text classification problem, we will use a popular dataset called 20-newsgroups.

Let's load this dataset.

Feature Representations for Text

Each data point $x$ in this dataset is a squence of characters of an arbitrary length.

How do we transform these into $d$-dimensional features $\phi(x)$ that can be used with our machine learning algorithms?

Bag of Words Representations

Perhaps the most widely used approach to representing text documents is called "bag of words".

We start by defining a vocabulary $V$ containing all the possible words we are interested in, e.g.: $$ V = \{\text{church}, \text{doctor}, \text{fervently}, \text{purple}, \text{slow}, ...\} $$

A bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector $$ \phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array} $$ of dimension $V$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

Let's see an example of this approach on 20-newsgroups.

We start by computing these features using the sklearn library.

In sklearn, we can retrieve the index of $\phi(x)$ associated with each word using the expression count_vect.vocabulary_.get(word):

Our featurized dataset is in the matrix X_train. We can use the above indices to retrieve the 0-1 value that has been computed for each word:

Practical Considerations

In practice, we may use some additional modifications of this techinque:

Classification Using BoW Features

Let's now have a look at the performance of classification over bag of words features.

Now that we have a feature representation $\phi(x)$, we can apply the classifier of our choice, such as logistic regression.

And now we can use this model for predicting on new inputs.

Summary of Text Classification

Part 2: Naive Bayes

Next, we are going to look at Naive Bayes --- a generative classification algorithm. We will apply Naive Bayes to the text classification problem.

Review: Classification

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

  1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
  2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

Review: Text Classification

An interesting instance of a classification problem is classifying text.

Review: Bag of Words Features

Given a vocabulary $V$, a bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector $$ \phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array} $$ of dimension $|V|$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

Review: Generative Models

There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

Given a new datapoint $x'$, we can match it against each class model and find the class that looks most similar to it: \begin{align*} \arg \max_y \log p(y | x) = \arg \max_y \log \frac{p(x | y) p(y)}{p(x)} = \arg \max_y \log p(x | y) p(y), \end{align*} where we have applied Bayes' rule in the second equation.

Review: Gaussian Discriminant Model

The GDA algorithm defines the following model family.

Thus, $P_\theta(x,y)$ is a mixture of $K$ Gaussians: $$P_\theta(x,y) = \sum_{k=1}^K P_\theta(y=k) P_\theta(x|y=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

Problem 1: Discrete Data

What would happen if we used GDA to perform text classification? The first problem we face is that the input data is discrete: $$ \phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array} $$ This data does not follows a Normal distribution, hence the GDA model is clearly misspecified.

Problem 2: High Dimensionality

A first solution is to assume that $x$ is sampled from a categorical distribution that assigns a probability to each possible state of $x$. $$ p(x) = p \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right. \left. \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \vdots \\ \;\text{purple} \end{array} \right) = 0.0012 $$

However, if the dimensionality $d$ of $x$ is high (e.g., vocabulary has size 10,000), $x$ can take a huge number of values ($2^{10000}$ in our example). We need to specify $2^{d}-1$ parameters for the categorical distribution.

Naive Bayes Assumption

In order to deal with high-dimensional $x$, we simplify the problem by making the Naive Bayes assumption: $$ p(x|y) = \prod_{j=1}^d p(x_j \mid y) $$ In other words, the probability $p(x|y)$ factorizes over each dimension.

Bernoulli Naive Bayes Model

We can apply the Naive Bayes assumption to obtain a model for when $x$ is in a bag of words representation.

The Bernoulli Naive Bayes model $P_\theta(x,y)$ is defined as follows:

Formally, we have: \begin{align*} P_\theta(y) & = \text{Categorical}(\phi_1,\phi_2,\ldots,\phi_K) \\ P_\theta(x_j=1|y=k) & = \text{Bernoullli}(\psi_{jk}) \\ P_\theta(x|y=k) & = \prod_{j=1}^d P_\theta(x_j|y=k) \end{align*}

Part 3: Naive Bayes: Learning

We are going to continue our discussion of Naive Bayes.

We will now turn our attention to learnig the parameters of the model and using them to make predictions.

Review: Text Classification

An interesting instance of a classification problem is classifying text.

Review: Bag of Words Features

Given a vocabulary $V$, a bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector $$ \phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array} $$ of dimension $|V|$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

Bernoulli Naive Bayes Model

The Bernoulli Naive Bayes model $P_\theta(x,y)$ is defined as follows:

Review: Maximum Likelihood Learning

In order to fit probabilistic models, we use the following objective: $$ \max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(x, y). $$ This seeks to find a model that assigns high probability to the training data.

Let's use maximum likelihood to fit the Bernoulli Naive Bayes model. Note that model parameterss $\theta$ are the union of the parameters of each sub-model: $$\theta = (\phi_1, \phi_2,\ldots, \phi_K, \psi_{11}, \psi_{21}, \ldots, \psi_{dK}).$$

Learning a Bernoulli Naive Bayes Model

Given a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\mid i=1,2,\ldots,n\}$, we want to optimize the log-likelihood $\ell(\theta) = \log L(\theta)$: \begin{align*} \ell(\theta) & = \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)}) = \sum_{i=1}^n \log P_\theta(x^{(i)} | y^{(i)}) + \sum_{i=1}^n \log P_\theta(y^{(i)}) \\ & = \sum_{k=1}^K \sum_{j=1}^d \underbrace{\sum_{i :y^{(i)} =k} \log P(x^{(i)}_j | y^{(i)} ; \psi_{jk})}_\text{all the terms that involve $\psi_{jk}$} + \underbrace{\sum_{i=1}^n \log P(y^{(i)} ; \vec \phi)}_\text{all the terms that involve $\vec \phi$}. \end{align*}

Notice that each parameter $\psi_{jk}$ is found in only one subset of terms and the $\phi_k$ are also in the same set of terms.

As in Gaussian Discriminant Analysis, the log-likelihood decomposes into a sum of terms. To optimize for some $\psi_{jk}$, we only need to look at the set of terms that contain $\psi_{jk}$: $$ \arg\max_{\psi_{jk}} \ell(\theta) = \arg\max_{\psi_{jk}} \sum_{i :y^{(i)} =k} \log p(x^{(i)}_j | y^{(i)} ; \psi_{jk}). $$

Similarly, optimizing for $\vec \phi = (\phi_1, \phi_2, \ldots, \phi_K)$ only involves a single term: $$ \max_{\vec \phi} \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)} ; \theta) = \max_{\vec\phi} \sum_{i=1}^n \log P_\theta(y^{(i)} ; \vec \phi). $$

Optimizing the Model Parameters

These observations greatly simplify the optimization of the model. Let's first consider the optimization over $\vec \phi = (\phi_1, \phi_2, \ldots, \phi_K)$.

As in Gaussian Discriminant Analysis, we can take a derivative over $\phi_k$ and set it to zero to obtain $$ \phi_k = \frac{n_k}{n}$$ for each $k$, where $n_k = |\{i : y^{(i)} = k\}|$ is the number of training targets with class $k$.

Thus, the optimal $\phi_k$ is just the proportion of data points with class $k$ in the training set!

Similarly, we can maximize the likelihood for the other parameters to obtain closed form solutions: \begin{align*} \psi_{jk} = \frac{n_{jk}}{n_k}. \end{align*} where $|\{i : x^{(i)}_j = 1 \text{ and } y^{(i)} = k\}|$ is the number of $x^{(i)}$ with label $k$ and a positive occurrence of word $j$.

Each $\psi_{jk}$ is simply the proportion of documents in class $k$ that contain the word $j$.

Querying the Model

How do we ask the model for predictions? As discussed earler, we can apply Bayes' rule: $$\arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P(y).$$ Thus, we can estimate the probability of $x$ and under each $P_\theta(x|y=k)P(y=k)$ and choose the class that explains the data best.

Classification Dataset: Twenty Newsgroups

To illustrate the text classification problem, we will use a popular dataset called 20-newsgroups.

Let's load this dataset.

Example: Text Classification

Let's see how this approach can be used in practice on the text classification dataset.

Let's see an example of this approach on 20-newsgroups.

We start by computing these features using the sklearn library.

Let's compute the maximum likelihood model parameters on our dataset.

We can compute predictions using Bayes' rule.

We can measure the accuracy on the training set:

Algorithm: Bernoulli Naive Bayes

Part 4: Discriminative vs. Generative Algorithms

We conclude our lectures on generative algorithms by revisting the question of how they compare to discriminative algorithms.

Review: Generative Models

There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

Given a new datapoint $x'$, we can match it against each class model and find the class that looks most similar to it: \begin{align*} \arg \max_y \log p(y | x) = \arg \max_y \log \frac{p(x | y) p(y)}{p(x)} = \arg \max_y \log p(x | y) p(y), \end{align*} where we have applied Bayes' rule in the second equation.

Review: Gaussian Discriminant Model

The GDA algorithm defines the following model family.

Thus, $P_\theta(x,y)$ is a mixture of $K$ Gaussians: $$P_\theta(x,y) = \sum_{k=1}^K P_\theta(y=k) P_\theta(x|y=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

Classification Dataset: Iris Flowers

To look at properties of generative algorithms, let's look again at the Iris flower dataset.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

If we only consider the first two feature columns, we can visualize the dataset in 2D.

Linear Discriminant Analysis

When the covariances $\Sigma_k$ in GDA are equal, we have an algorithm called Linear Discriminant Analysis or LDA.

Let's try this algorithm on the Iris flower dataset.

We may compute the parameters of this model similarly to how we did for GDA.

We can compute predictions using Bayes' rule.

We visualize predictions like we did earlier.

Linear Discriminant Analysis outputs decision boundaries that are linear.

Softmax or Logistic regression also produce linear boundaries. In fact, both types of algorithms make use of the same model class.

What is their difference then?

Generative vs. Discriminative Model Classes

In binary classification, we can also show that the conditional probability $P_\theta(y|x)$ of a Bernoulli Naive Bayes or LDA model has the form $$ P_\theta(y|x) = \frac{P_\theta(x|y)P_\theta(y)}{\sum_{y'\in \mathcal{Y}}P_\theta(x|y')P_\theta(y')} = \frac{1}{1+\exp(-\gamma^\top x)} $$ for some set of parameters $\gamma$ (whose expression can be derived from $\theta$), which is the same form as Logistic Regression!

Does it mean that the two sets of algorithms are equivalent? No! They assume the same model class $\mathcal{M}$, they use a different objective $J$ to select a model in $\mathcal{M}$.

Generative Models vs. Logistic Regression

Given that both algorithms find linear boundaries, how should one choose between the two?

Other Features of Generative Models

Generative models can also do things that discriminative models can't do.

Discriminative Approaches

Discriminative algorithms are deservingly very popular.

Generative Approaches

But generative algorithms also have many advantages: