 # Lecture 8: Naive Bayes¶

### Applied Machine Learning¶

Volodymyr Kuleshov
Cornell Tech

# Part 1: Text Classification¶

We will now do a quick detour to talk about an important application area of machine learning: text classification.

Afterwards, we will see how text classification motivates new classification algorithms.

# Review: Classification¶

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

# Text Classification¶

An interesting instance of a classification problem is classifying text.

• Includes a lot applied problems: spam filtering, fraud detection, medical record classification, etc.
• Inputs $x$ are sequences of words of an arbitrary length.
• The dimensionality of text inputs is usually very large, proportional to the size of the vocabulary.

# Classification Dataset: Twenty Newsgroups¶

To illustrate the text classification problem, we will use a popular dataset called 20-newsgroups.

• It contains ~20,000 documents collected approximately evenly from 20 different online newsgroups.
• Each newgroup covers a different topic such as medicine, computer graphics, or religion.
• This dataset is widely used to benchmark text classification and other types of algorithms.

# Feature Representations for Text¶

Each data point $x$ in this dataset is a squence of characters of an arbitrary length.

How do we transform these into $d$-dimensional features $\phi(x)$ that can be used with our machine learning algorithms?

• We may devise hand-crafted features by inspecting the data:
• Does the message contain the word "church"? Does the email of the user originate outside the United States? Is the organization a university? etc.
• We can count the number of occurrences of each word:
• Does this message contain "Aardvark", yes or no?
• Does this message contain "Apple", yes or no?
• ... Does this message contain "Zebra", yes or no?
• Finally, many modern deep learning methods can directly work with sequences of characters of an arbitrary length.

# Bag of Words Representations¶

Perhaps the most widely used approach to representing text documents is called "bag of words".

We start by defining a vocabulary $V$ containing all the possible words we are interested in, e.g.: $$V = \{\text{church}, \text{doctor}, \text{fervently}, \text{purple}, \text{slow}, ...\}$$

A bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector $$\phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array}$$ of dimension $V$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

Let's see an example of this approach on 20-newsgroups.

We start by computing these features using the sklearn library.

In sklearn, we can retrieve the index of $\phi(x)$ associated with each word using the expression count_vect.vocabulary_.get(word):

Our featurized dataset is in the matrix X_train. We can use the above indices to retrieve the 0-1 value that has been computed for each word:

# Practical Considerations¶

In practice, we may use some additional modifications of this techinque:

• Sometimes, the feature $\phi(x)_j$ for the $j$-th word holds the count of occurrences of word $j$ instead of just the binary occurrence.
• The raw text is usually preprocessed. One common technique is stemming, in which we only keep the root of the word.
• e.g. "slowly", "slowness", both map to "slow"
• Filtering for common stopwords such as "the", "a", "and". Similarly, rare words are also typically excluded.

# Classification Using BoW Features¶

Let's now have a look at the performance of classification over bag of words features.

Now that we have a feature representation $\phi(x)$, we can apply the classifier of our choice, such as logistic regression.

And now we can use this model for predicting on new inputs.

# Summary of Text Classification¶

• Classifying text normally requires specifiyng features over the raw data.
• A widely used representation is "bag of words", in which features are occurrences or counts of words.
• Once text is featurized, any off-the-shelf supervised learning algorithm can be applied, but some work better than others, as we will see next. # Part 2: Naive Bayes¶

Next, we are going to look at Naive Bayes --- a generative classification algorithm. We will apply Naive Bayes to the text classification problem.

# Review: Classification¶

Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$.

1. Regression: The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$.
2. Classification: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. Each discrete value corresponds to a class that we want to predict.

# Review: Text Classification¶

An interesting instance of a classification problem is classifying text.

• Includes a lot applied problems: spam filtering, fraud detection, medical record classification, etc.
• Inputs $x$ are sequences of words of an arbitrary length.
• The dimensionality of text inputs is usually very large, proportional to the size of the vocabulary.

# Review: Bag of Words Features¶

Given a vocabulary $V$, a bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector $$\phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array}$$ of dimension $|V|$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

# Review: Generative Models¶

There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

Given a new datapoint $x'$, we can match it against each class model and find the class that looks most similar to it: \begin{align*} \arg \max_y \log p(y | x) = \arg \max_y \log \frac{p(x | y) p(y)}{p(x)} = \arg \max_y \log p(x | y) p(y), \end{align*} where we have applied Bayes' rule in the second equation.

# Review: Gaussian Discriminant Model¶

The GDA algorithm defines the following model family.

• The probability $P(x\mid y=k)$ of the data under class $k$ is a multivariate Gaussian $\mathcal{N}(x; \mu_k, \Sigma_k)$ with parameters $\mu_k, \Sigma_k$.
• The distribution over classes is Categorical, denoted $\text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$. Thus, $P_\theta(y=k) = \phi_k$.

Thus, $P_\theta(x,y)$ is a mixture of $K$ Gaussians: $$P_\theta(x,y) = \sum_{k=1}^K P_\theta(y=k) P_\theta(x|y=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

# Problem 1: Discrete Data¶

What would happen if we used GDA to perform text classification? The first problem we face is that the input data is discrete: $$\phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array}$$ This data does not follows a Normal distribution, hence the GDA model is clearly misspecified.

# Problem 2: High Dimensionality¶

A first solution is to assume that $x$ is sampled from a categorical distribution that assigns a probability to each possible state of $x$. $$p(x) = p \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right. \left. \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \vdots \\ \;\text{purple} \end{array} \right) = 0.0012$$

However, if the dimensionality $d$ of $x$ is high (e.g., vocabulary has size 10,000), $x$ can take a huge number of values ($2^{10000}$ in our example). We need to specify $2^{d}-1$ parameters for the categorical distribution.

# Naive Bayes Assumption¶

In order to deal with high-dimensional $x$, we simplify the problem by making the Naive Bayes assumption: $$p(x|y) = \prod_{j=1}^d p(x_j \mid y)$$ In other words, the probability $p(x|y)$ factorizes over each dimension.

• For example, if $x$ is a binary bag of words representation, then $p(x_j | y)$ is the probability of seeing the $j$-th word.
• We can model each $p(x_j | y)$ via a Bernoulli distribution, which has only one parameter.
• Hence, it takes only $d$ parameters (instead of $2^d-1$) to specify the entire distribution $p(x|y) = \prod_{j=1}^d p(x_j \mid y)$.

# Bernoulli Naive Bayes Model¶

We can apply the Naive Bayes assumption to obtain a model for when $x$ is in a bag of words representation.

The Bernoulli Naive Bayes model $P_\theta(x,y)$ is defined as follows:

• The distribution over classes is Categorical, denoted $\text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$. Thus, $P_\theta(y=k) = \phi_k$.
• The conditional probability of the data under class $k$ factorizes as $P_\theta(x|y=k) = \prod_{j=1}^d P(x_j \mid y=k)$ (the Naive Bayes assumption), where each $P_\theta(x_j \mid y=k)$ is a $\text{Bernoullli}(\psi_{jk})$.

Formally, we have: \begin{align*} P_\theta(y) & = \text{Categorical}(\phi_1,\phi_2,\ldots,\phi_K) \\ P_\theta(x_j=1|y=k) & = \text{Bernoullli}(\psi_{jk}) \\ P_\theta(x|y=k) & = \prod_{j=1}^d P_\theta(x_j|y=k) \end{align*} # Part 3: Naive Bayes: Learning¶

We are going to continue our discussion of Naive Bayes.

We will now turn our attention to learnig the parameters of the model and using them to make predictions.

# Review: Text Classification¶

An interesting instance of a classification problem is classifying text.

• Includes a lot applied problems: spam filtering, fraud detection, medical record classification, etc.
• Inputs $x$ are sequences of words of an arbitrary length.
• The dimensionality of text inputs is usually very large, proportional to the size of the vocabulary.

# Review: Bag of Words Features¶

Given a vocabulary $V$, a bag of words representation of a document $x$ is a function $\phi(x) \to \{0,1\}^{|V|}$ that outputs a feature vector $$\phi(x) = \left( \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ \vdots \\ \end{array} \right) \begin{array}{l} \;\text{church} \\ \;\text{doctor} \\ \;\text{fervently} \\ \\ \;\text{purple} \\ \\ \end{array}$$ of dimension $|V|$. The $j$-th component $\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise.

# Bernoulli Naive Bayes Model¶

The Bernoulli Naive Bayes model $P_\theta(x,y)$ is defined as follows:

• The distribution over classes is Categorical, denoted $\text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$. Thus, $P_\theta(y=k) = \phi_k$.
• The conditional probability of the data under class $k$ factorizes as $P_\theta(x|y=k) = \prod_{j=1}^d P(x_j \mid y=k)$ (the Naive Bayes assumption), where each $P_\theta(x_j \mid y=k)$ is a $\text{Bernoullli}(\psi_{jk})$.

# Review: Maximum Likelihood Learning¶

In order to fit probabilistic models, we use the following objective: $$\max_\theta \mathbb{E}_{x, y \sim \mathbb{P}_\text{data}} \log P_\theta(x, y).$$ This seeks to find a model that assigns high probability to the training data.

Let's use maximum likelihood to fit the Bernoulli Naive Bayes model. Note that model parameterss $\theta$ are the union of the parameters of each sub-model: $$\theta = (\phi_1, \phi_2,\ldots, \phi_K, \psi_{11}, \psi_{21}, \ldots, \psi_{dK}).$$

# Learning a Bernoulli Naive Bayes Model¶

Given a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\mid i=1,2,\ldots,n\}$, we want to optimize the log-likelihood $\ell(\theta) = \log L(\theta)$: \begin{align*} \ell(\theta) & = \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)}) = \sum_{i=1}^n \log P_\theta(x^{(i)} | y^{(i)}) + \sum_{i=1}^n \log P_\theta(y^{(i)}) \\ & = \sum_{k=1}^K \sum_{j=1}^d \underbrace{\sum_{i :y^{(i)} =k} \log P(x^{(i)}_j | y^{(i)} ; \psi_{jk})}_\text{all the terms that involve $\psi_{jk}$} + \underbrace{\sum_{i=1}^n \log P(y^{(i)} ; \vec \phi)}_\text{all the terms that involve $\vec \phi$}. \end{align*}

Notice that each parameter $\psi_{jk}$ is found in only one subset of terms and the $\phi_k$ are also in the same set of terms.

As in Gaussian Discriminant Analysis, the log-likelihood decomposes into a sum of terms. To optimize for some $\psi_{jk}$, we only need to look at the set of terms that contain $\psi_{jk}$: $$\arg\max_{\psi_{jk}} \ell(\theta) = \arg\max_{\psi_{jk}} \sum_{i :y^{(i)} =k} \log p(x^{(i)}_j | y^{(i)} ; \psi_{jk}).$$

Similarly, optimizing for $\vec \phi = (\phi_1, \phi_2, \ldots, \phi_K)$ only involves a single term: $$\max_{\vec \phi} \sum_{i=1}^n \log P_\theta(x^{(i)}, y^{(i)} ; \theta) = \max_{\vec\phi} \sum_{i=1}^n \log P_\theta(y^{(i)} ; \vec \phi).$$

# Optimizing the Model Parameters¶

These observations greatly simplify the optimization of the model. Let's first consider the optimization over $\vec \phi = (\phi_1, \phi_2, \ldots, \phi_K)$.

As in Gaussian Discriminant Analysis, we can take a derivative over $\phi_k$ and set it to zero to obtain $$\phi_k = \frac{n_k}{n}$$ for each $k$, where $n_k = |\{i : y^{(i)} = k\}|$ is the number of training targets with class $k$.

Thus, the optimal $\phi_k$ is just the proportion of data points with class $k$ in the training set!

Similarly, we can maximize the likelihood for the other parameters to obtain closed form solutions: \begin{align*} \psi_{jk} = \frac{n_{jk}}{n_k}. \end{align*} where $|\{i : x^{(i)}_j = 1 \text{ and } y^{(i)} = k\}|$ is the number of $x^{(i)}$ with label $k$ and a positive occurrence of word $j$.

Each $\psi_{jk}$ is simply the proportion of documents in class $k$ that contain the word $j$.

# Querying the Model¶

How do we ask the model for predictions? As discussed earler, we can apply Bayes' rule: $$\arg\max_y P_\theta(y|x) = \arg\max_y P_\theta(x|y)P(y).$$ Thus, we can estimate the probability of $x$ and under each $P_\theta(x|y=k)P(y=k)$ and choose the class that explains the data best.

# Classification Dataset: Twenty Newsgroups¶

To illustrate the text classification problem, we will use a popular dataset called 20-newsgroups.

• It contains ~20,000 documents collected approximately evenly from 20 different online newsgroups.
• Each newgroup covers a different topic such as medicine, computer graphics, or religion.
• This dataset is widely used to benchmark text classification and other types of algorithms.

# Example: Text Classification¶

Let's see how this approach can be used in practice on the text classification dataset.

• We will learn a good set of parameters for a Bernoulli Naive Bayes model
• We will compare the outputs to the true predictions.

Let's see an example of this approach on 20-newsgroups.

We start by computing these features using the sklearn library.

Let's compute the maximum likelihood model parameters on our dataset.

We can compute predictions using Bayes' rule.

We can measure the accuracy on the training set:

# Algorithm: Bernoulli Naive Bayes¶

• Type: Supervised learning (multi-class classification)
• Model family: Mixtures of Bernoulli distributions
• Objective function: Log-likelihood.
• Optimizer: Closed form solution. # Part 4: Discriminative vs. Generative Algorithms¶

We conclude our lectures on generative algorithms by revisting the question of how they compare to discriminative algorithms.

# Review: Generative Models¶

There are two types of probabilistic models: generative and discriminative. \begin{align*} \underbrace{P_\theta(x,y) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{generative model} & \;\; & \underbrace{P_\theta(y|x) : \mathcal{X} \times \mathcal{Y} \to [0,1]}_\text{discriminative model} \end{align*}

Given a new datapoint $x'$, we can match it against each class model and find the class that looks most similar to it: \begin{align*} \arg \max_y \log p(y | x) = \arg \max_y \log \frac{p(x | y) p(y)}{p(x)} = \arg \max_y \log p(x | y) p(y), \end{align*} where we have applied Bayes' rule in the second equation.

# Review: Gaussian Discriminant Model¶

The GDA algorithm defines the following model family.

• The probability $P(x\mid y=k)$ of the data under class $k$ is a multivariate Gaussian $\mathcal{N}(x; \mu_k, \Sigma_k)$ with parameters $\mu_k, \Sigma_k$.
• The distribution over classes is Categorical, denoted $\text{Categorical}(\phi_1, \phi_2, ..., \phi_K)$. Thus, $P_\theta(y=k) = \phi_k$.

Thus, $P_\theta(x,y)$ is a mixture of $K$ Gaussians: $$P_\theta(x,y) = \sum_{k=1}^K P_\theta(y=k) P_\theta(x|y=k) = \sum_{k=1}^K \phi_k \mathcal{N}(x; \mu_k, \Sigma_k)$$

# Classification Dataset: Iris Flowers¶

To look at properties of generative algorithms, let's look again at the Iris flower dataset.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

If we only consider the first two feature columns, we can visualize the dataset in 2D.

# Linear Discriminant Analysis¶

When the covariances $\Sigma_k$ in GDA are equal, we have an algorithm called Linear Discriminant Analysis or LDA.

Let's try this algorithm on the Iris flower dataset.

We may compute the parameters of this model similarly to how we did for GDA.

We can compute predictions using Bayes' rule.

We visualize predictions like we did earlier.

Linear Discriminant Analysis outputs decision boundaries that are linear.

Softmax or Logistic regression also produce linear boundaries. In fact, both types of algorithms make use of the same model class.

What is their difference then?

# Generative vs. Discriminative Model Classes¶

In binary classification, we can also show that the conditional probability $P_\theta(y|x)$ of a Bernoulli Naive Bayes or LDA model has the form $$P_\theta(y|x) = \frac{P_\theta(x|y)P_\theta(y)}{\sum_{y'\in \mathcal{Y}}P_\theta(x|y')P_\theta(y')} = \frac{1}{1+\exp(-\gamma^\top x)}$$ for some set of parameters $\gamma$ (whose expression can be derived from $\theta$), which is the same form as Logistic Regression!

Does it mean that the two sets of algorithms are equivalent? No! They assume the same model class $\mathcal{M}$, they use a different objective $J$ to select a model in $\mathcal{M}$.

# Generative Models vs. Logistic Regression¶

Given that both algorithms find linear boundaries, how should one choose between the two?

• Bernoulli Naive Bayes or LDA assumes a logistic form for $p(y|x)$. But converse is not true: logistic regression does not assume a NB or LDA model for $p(x,y)$.
• Generative models make stronger modeling assumptions. If these assumptions hold true, the generative models will perform better.
• But if they don't, logistic regression will be more robust to outliers and model misspecification, and achieve higher accuracy.

# Other Features of Generative Models¶

Generative models can also do things that discriminative models can't do.

• Generation: we can sample $x \sim p(x|y)$ to generate new data (images, audio).
• Missing value imputation: if $x_j$ is missing, we can infer it using $p(x|y)$.
• Outlier detection: given a new $x'$, we can try detecting via $p(x')$ if $x'$ is invalid.

# Discriminative Approaches¶

Discriminative algorithms are deservingly very popular.

• Most state-of-the-art algorithms for classification are discriminative
• They are often more accurate because they make fewer modeling assumptions.

# Generative Approaches¶

But generative algorithms also have many advantages:

• Can do more than just prediction: generation, fill-in missing features, etc.
• Can include extra prior knowledge; if prior knowledge is correct, model will be more accurate.
• Often have closed-form solutions, hence are faster to train.