{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "i_f5u2x9nn6I", "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "\n", "# Lecture 8: Naive Bayes\n", "\n", "### Applied Machine Learning\n", "\n", "__Volodymyr Kuleshov__
Cornell Tech" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Part 1: Text Classification\n", "\n", "We will now do a quick detour to talk about an important application area of machine learning: text classification. \n", "\n", "Afterwards, we will see how text classification motivates new classification algorithms." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Classification\n", "\n", "Consider a training dataset $\\mathcal{D} = \\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \\ldots, (x^{(n)}, y^{(n)})\\}$.\n", "\n", "We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$. \n", "\n", "1. __Regression__: The target variable $y \\in \\mathcal{Y}$ is continuous: $\\mathcal{Y} \\subseteq \\mathbb{R}$.\n", "2. __Classification__: The target variable $y$ is discrete and takes on one of $K$ possible values: $\\mathcal{Y} = \\{y_1, y_2, \\ldots y_K\\}$. Each discrete value corresponds to a *class* that we want to predict." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Text Classification\n", "\n", "An interesting instance of a classification problem is classifying text.\n", "* Includes a lot applied problems: spam filtering, fraud detection, medical record classification, etc.\n", "* Inputs $x$ are sequences of words of an arbitrary length.\n", "* The dimensionality of text inputs is usually very large, proportional to the size of the vocabulary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classification Dataset: Twenty Newsgroups\n", "\n", "To illustrate the text classification problem, we will use a popular dataset called 20-newsgroups. \n", "* It contains ~20,000 documents collected approximately evenly from 20 different online newsgroups.\n", "* Each newgroup covers a different topic such as medicine, computer graphics, or religion.\n", "* This dataset is widely used to benchmark text classification and other types of algorithms." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's load this dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _20newsgroups_dataset:\n", "\n", "The 20 newsgroups text dataset\n", "------------------------------\n", "\n", "The 20 newsgroups dataset comprises around 18000 newsgroups posts on\n", "20 topics split in two subsets: one for training (or development)\n", "and the other one for testing (or for performance evaluation). The split\n", "between the train and test set is based upon a messages posted before\n", "and after a specific date.\n", "\n", "This module contains two loaders. The first one,\n", ":func:sklearn.datasets.fetch_20newsgroups,\n", "returns a list of the raw texts that can be fed to text feature\n", "extractors such as :class:sklearn.feature_extraction.text.CountVectorizer\n", "with custom parameters so as to extract feature vectors.\n", "The second one, :func:sklearn.datasets.fetch_20newsgroups_vectorized,\n", "returns ready-to-use features, i.e., it is not necessary to use a feature\n", "extractor.\n", "\n", "**Data Set Characteristics:**\n", "\n", " ================= ==========\n", " Classes 20\n", " Samples total 18846\n", " Dimensionality 1\n", " Features text\n", " ================= ==========\n", "\n", "Usage\n", "~~~~~\n", "\n", "\n" ] } ], "source": [ "#https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html\n", " \n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.datasets import fetch_20newsgroups\n", "\n", "# for this lecture, we will restrict our attention to just 4 different newsgroups:\n", "categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']\n", "\n", "# load the dataset\n", "twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)\n", "\n", "# print some information on it\n", "print(twenty_train.DESCR[:1100])" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The set of targets in this dataset are the newgroup topics:\n", "twenty_train.target_names" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: s0612596@let.rug.nl (M.M. Zwart)\n", "Subject: catholic church poland\n", "Organization: Faculteit der Letteren, Rijksuniversiteit Groningen, NL\n", "Lines: 10\n", "\n", "Hello,\n", "\n", "I'm writing a paper on the role of the catholic church in Poland after 1989. \n", "Can anyone tell me more about this, or fill me in on recent books/articles(\n", "in english, german or french). Most important for me is the role of the \n", "church concerning the abortion-law, religious education at schools,\n", "birth-control and the relation church-state(government). Thanx,\n", "\n", " Masja,\n", "\"M.M.Zwart\"\n", "\n" ] } ], "source": [ "# Let's examine one data point\n", "print(twenty_train.data)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2257\n" ] } ], "source": [ "# We have about 2k data points in total\n", "print(len(twenty_train.data))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Feature Representations for Text\n", "\n", "Each data point $x$ in this dataset is a squence of characters of an arbitrary length.\n", "\n", "How do we transform these into $d$-dimensional features $\\phi(x)$ that can be used with our machine learning algorithms?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "* We may devise hand-crafted features by inspecting the data:\n", " * Does the message contain the word \"church\"? Does the email of the user originate outside the United States? Is the organization a university? etc. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* We can count the number of occurrences of each word:\n", " * Does this message contain \"Aardvark\", yes or no?\n", " * Does this message contain \"Apple\", yes or no?\n", " * ... Does this message contain \"Zebra\", yes or no?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Finally, many modern deep learning methods can directly work with sequences of characters of an arbitrary length." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bag of Words Representations\n", "\n", "Perhaps the most widely used approach to representing text documents is called \"bag of words\".\n", "\n", "We start by defining a vocabulary $V$ containing all the possible words we are interested in, e.g.:\n", "$$V = \\{\\text{church}, \\text{doctor}, \\text{fervently}, \\text{purple}, \\text{slow}, ...\\}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A bag of words representation of a document $x$ is a function $\\phi(x) \\to \\{0,1\\}^{|V|}$ that outputs a feature vector\n", "$$\n", "\\phi(x) = \\left( \n", "\\begin{array}{c}\n", "0 \\\\\n", "1 \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "\\end{array}\n", "\\right)\n", "\\begin{array}{l}\n", "\\;\\text{church} \\\\\n", "\\;\\text{doctor} \\\\\n", "\\;\\text{fervently} \\\\\n", "\\\\\n", "\\;\\text{purple} \\\\\n", "\\\\\n", "\\end{array}\n", "$$\n", "of dimension $V$. The $j$-th component $\\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's see an example of this approach on 20-newsgroups.\n", "\n", "We start by computing these features using the sklearn library." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(2257, 35788)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "# vectorize the training set\n", "count_vect = CountVectorizer(binary=True)\n", "X_train = count_vect.fit_transform(twenty_train.data)\n", "X_train.shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In sklearn, we can retrieve the index of $\\phi(x)$ associated with each word using the expression count_vect.vocabulary_.get(word):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index for the word \"church\": 8609\n", "Index for the word \"computer\": 9338\n" ] } ], "source": [ "# The CountVectorizer class records the index j associated with each word in V\n", "print('Index for the word \"church\": ', count_vect.vocabulary_.get(u'church'))\n", "print('Index for the word \"computer\": ', count_vect.vocabulary_.get(u'computer'))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Our featurized dataset is in the matrix X_train. We can use the above indices to retrieve the 0-1 value that has been computed for each word:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "From: s0612596@let.rug.nl (M.M. Zwart)\n", "Subject: catholic church poland\n", "Organization: Faculteit der Letteren, Rijksuniversiteit Groningen, NL\n", "Lines: 10\n", "\n", "Hello,\n", "\n", "I'm writing a paper on the role of the catholic church in Poland after 1989. \n", "Can anyone tell me more about this, or fill me in on recent books/articles(\n", "in english, german or french). Most important for me is the role of the \n", "church concerning the abortion-law, religious education at schools,\n", "birth-control and the relation church-state(government). Thanx,\n", "\n", " Masja,\n", "\"M.M.Zwart\"\n", "\n", "------------------------------------------------------------\n", "Value at the index for the word \"church\": 1\n", "Value at the index for the word \"computer\": 0\n", "Value at the index for the word \"doctor\": 0\n", "Value at the index for the word \"important\": 1\n" ] } ], "source": [ "# We can examine if any of these words are present in our previous datapoint\n", "print(twenty_train.data)\n", "\n", "# let's see if it contains these two words?\n", "print('---'*20)\n", "print('Value at the index for the word \"church\": ', X_train[3, count_vect.vocabulary_.get(u'church')])\n", "print('Value at the index for the word \"computer\": ', X_train[3, count_vect.vocabulary_.get(u'computer')])\n", "print('Value at the index for the word \"doctor\": ', X_train[3, count_vect.vocabulary_.get(u'doctor')])\n", "print('Value at the index for the word \"important\": ', X_train[3, count_vect.vocabulary_.get(u'important')])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Practical Considerations\n", "\n", "In practice, we may use some additional modifications of this techinque:" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Sometimes, the feature $\\phi(x)_j$ for the $j$-th word holds the count of occurrences of word $j$ instead of just the binary occurrence." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* The raw text is usually preprocessed. One common technique is *stemming*, in which we only keep the root of the word.\n", " * e.g. \"slowly\", \"slowness\", both map to \"slow\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Filtering for common *stopwords* such as \"the\", \"a\", \"and\". Similarly, rare words are also typically excluded." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classification Using BoW Features\n", "\n", "Let's now have a look at the performance of classification over bag of words features.\n", "\n", "Now that we have a feature representation $\\phi(x)$, we can apply the classifier of our choice, such as logistic regression." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n", "[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s finished\n" ] }, { "data": { "text/plain": [ "LogisticRegression(C=100000.0, multi_class='multinomial', verbose=True)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "# Create an instance of Softmax and fit the data.\n", "logreg = LogisticRegression(C=1e5, multi_class='multinomial', verbose=True)\n", "logreg.fit(X_train, twenty_train.target)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And now we can use this model for predicting on new inputs." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'God is love' => soc.religion.christian\n", "'OpenGL on the GPU is fast' => comp.graphics\n" ] } ], "source": [ "docs_new = ['God is love', 'OpenGL on the GPU is fast']\n", "\n", "X_new = count_vect.transform(docs_new)\n", "predicted = logreg.predict(X_new)\n", "\n", "for doc, category in zip(docs_new, predicted):\n", " print('%r => %s' % (doc, twenty_train.target_names[category]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Summary of Text Classification\n", "\n", "* Classifying text normally requires specifiyng features over the raw data.\n", "* A widely used representation is \"bag of words\", in which features are occurrences or counts of words." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Once text is featurized, any off-the-shelf supervised learning algorithm can be applied, but some work better than others, as we will see next." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "# Part 2: Naive Bayes\n", "\n", "Next, we are going to look at Naive Bayes --- a generative classification algorithm. We will apply Naive Bayes to the text classification problem.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Review: Classification\n", "\n", "Consider a training dataset $\\mathcal{D} = \\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \\ldots, (x^{(n)}, y^{(n)})\\}$.\n", "\n", "We distinguish between two types of supervised learning problems depnding on the targets $y^{(i)}$. \n", "\n", "1. __Regression__: The target variable $y \\in \\mathcal{Y}$ is continuous: $\\mathcal{Y} \\subseteq \\mathbb{R}$.\n", "2. __Classification__: The target variable $y$ is discrete and takes on one of $K$ possible values: $\\mathcal{Y} = \\{y_1, y_2, \\ldots y_K\\}$. Each discrete value corresponds to a *class* that we want to predict." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Text Classification\n", "\n", "An interesting instance of a classification problem is classifying text.\n", "* Includes a lot applied problems: spam filtering, fraud detection, medical record classification, etc.\n", "* Inputs $x$ are sequences of words of an arbitrary length.\n", "* The dimensionality of text inputs is usually very large, proportional to the size of the vocabulary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Bag of Words Features\n", "\n", "Given a vocabulary $V$, a bag of words representation of a document $x$ is a function $\\phi(x) \\to \\{0,1\\}^{|V|}$ that outputs a feature vector\n", "$$\n", "\\phi(x) = \\left( \n", "\\begin{array}{c}\n", "0 \\\\\n", "1 \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "\\end{array}\n", "\\right)\n", "\\begin{array}{l}\n", "\\;\\text{church} \\\\\n", "\\;\\text{doctor} \\\\\n", "\\;\\text{fervently} \\\\\n", "\\\\\n", "\\;\\text{purple} \\\\\n", "\\\\\n", "\\end{array}\n", "$$\n", "of dimension $|V|$. The $j$-th component $\\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Generative Models\n", "\n", "There are two types of probabilistic models: *generative* and *discriminative*.\n", "\\begin{align*}\n", "\\underbrace{P_\\theta(x,y) : \\mathcal{X} \\times \\mathcal{Y} \\to [0,1]}_\\text{generative model} & \\;\\; & \\underbrace{P_\\theta(y|x) : \\mathcal{X} \\times \\mathcal{Y} \\to [0,1]}_\\text{discriminative model}\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Given a new datapoint $x'$, we can match it against each class model and find the class that looks most similar to it:\n", "\\begin{align*}\n", "\\arg \\max_y \\log p(y | x) = \\arg \\max_y \\log \\frac{p(x | y) p(y)}{p(x)} = \\arg \\max_y \\log p(x | y) p(y),\n", "\\end{align*}\n", "where we have applied Bayes' rule in the second equation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Gaussian Discriminant Model\n", "\n", "The GDA algorithm defines the following model family.\n", "* The probability $P(x\\mid y=k)$ of the data under class $k$ is a [multivariate Gaussian](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) $\\mathcal{N}(x; \\mu_k, \\Sigma_k)$ with parameters\n", "$\\mu_k, \\Sigma_k$.\n", "* The distribution over classes is [Categorical](https://en.wikipedia.org/wiki/Categorical_distribution), denoted $\\text{Categorical}(\\phi_1, \\phi_2, ..., \\phi_K)$. Thus, $P_\\theta(y=k) = \\phi_k$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Thus, $P_\\theta(x,y)$ is a mixture of $K$ Gaussians:\n", "$$P_\\theta(x,y) = \\sum_{k=1}^K P_\\theta(y=k) P_\\theta(x|y=k) = \\sum_{k=1}^K \\phi_k \\mathcal{N}(x; \\mu_k, \\Sigma_k)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Problem 1: Discrete Data\n", "\n", "What would happen if we used GDA to perform text classification?\n", "The first problem we face is that the input data is discrete:\n", "$$\n", "\\phi(x) = \\left( \n", "\\begin{array}{c}\n", "0 \\\\\n", "1 \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "\\end{array}\n", "\\right)\n", "\\begin{array}{l}\n", "\\;\\text{church} \\\\\n", "\\;\\text{doctor} \\\\\n", "\\;\\text{fervently} \\\\\n", "\\\\\n", "\\;\\text{purple} \\\\\n", "\\\\\n", "\\end{array}\n", "$$\n", "This data does not follows a Normal distribution, hence the GDA model is clearly misspecified." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Problem 2: High Dimensionality\n", "\n", "A first solution is to assume that $x$ is sampled from a categorical distribution that assigns a probability to each possible state of $x$.\n", "$$\n", "p(x) = p \\left( \n", "\\begin{array}{c}\n", "0 \\\\\n", "1 \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "0 \n", "\\end{array}\n", "\\right.\n", "\\left.\n", "\\begin{array}{l}\n", "\\;\\text{church} \\\\\n", "\\;\\text{doctor} \\\\\n", "\\;\\text{fervently} \\\\\n", "\\vdots \\\\\n", "\\;\\text{purple}\n", "\\end{array}\n", "\\right) = 0.0012\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, if the dimensionality $d$ of $x$ is high (e.g., vocabulary has size 10,000), $x$ can take a huge number of values ($2^{10000}$ in our example). We need to specify $2^{d}-1$ parameters for the categorical distribution." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Naive Bayes Assumption\n", "\n", "In order to deal with high-dimensional $x$, we simplify the problem by making the *Naive Bayes* assumption:\n", "$$p(x|y) = \\prod_{j=1}^d p(x_j \\mid y)$$\n", "In other words, the probability $p(x|y)$ factorizes over each dimension." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "* For example, if $x$ is a binary bag of words representation, then $p(x_j | y)$ is the probability of seeing the $j$-th word." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* We can model each $p(x_j | y)$ via a Bernoulli distribution, which has only one parameter. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Hence, it takes only $d$ parameters (instead of $2^d-1$) to specify the entire distribution $p(x|y) = \\prod_{j=1}^d p(x_j \\mid y)$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bernoulli Naive Bayes Model\n", "\n", "We can apply the Naive Bayes assumption to obtain a model for when $x$ is in a bag of words representation.\n", "\n", "The *Bernoulli Naive Bayes* model $P_\\theta(x,y)$ is defined as follows:\n", "* The distribution over classes is [Categorical](https://en.wikipedia.org/wiki/Categorical_distribution), denoted $\\text{Categorical}(\\phi_1, \\phi_2, ..., \\phi_K)$. Thus, $P_\\theta(y=k) = \\phi_k$.\n", "* The conditional probability of the data under class $k$ factorizes as $P_\\theta(x|y=k) = \\prod_{j=1}^d P(x_j \\mid y=k)$ (the Naive Bayes assumption), where each $P_\\theta(x_j \\mid y=k)$ is a $\\text{Bernoullli}(\\psi_{jk})$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Formally, we have:\n", "\\begin{align*}\n", "P_\\theta(y) & = \\text{Categorical}(\\phi_1,\\phi_2,\\ldots,\\phi_K) \\\\\n", "P_\\theta(x_j=1|y=k) & = \\text{Bernoullli}(\\psi_{jk}) \\\\\n", "P_\\theta(x|y=k) & = \\prod_{j=1}^d P_\\theta(x_j|y=k)\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " \n", "# Part 3: Naive Bayes: Learning\n", "\n", "We are going to continue our discussion of Naive Bayes.\n", "\n", "We will now turn our attention to learnig the parameters of the model and using them to make predictions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Text Classification\n", "\n", "An interesting instance of a classification problem is classifying text.\n", "* Includes a lot applied problems: spam filtering, fraud detection, medical record classification, etc.\n", "* Inputs $x$ are sequences of words of an arbitrary length.\n", "* The dimensionality of text inputs is usually very large, proportional to the size of the vocabulary." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Bag of Words Features\n", "\n", "Given a vocabulary $V$, a bag of words representation of a document $x$ is a function $\\phi(x) \\to \\{0,1\\}^{|V|}$ that outputs a feature vector\n", "$$\n", "\\phi(x) = \\left( \n", "\\begin{array}{c}\n", "0 \\\\\n", "1 \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "0 \\\\\n", "\\vdots \\\\\n", "\\end{array}\n", "\\right)\n", "\\begin{array}{l}\n", "\\;\\text{church} \\\\\n", "\\;\\text{doctor} \\\\\n", "\\;\\text{fervently} \\\\\n", "\\\\\n", "\\;\\text{purple} \\\\\n", "\\\\\n", "\\end{array}\n", "$$\n", "of dimension $|V|$. The $j$-th component $\\phi(x)_j$ equals $1$ if $x$ convains the $j$-th word in $V$ and $0$ otherwise." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bernoulli Naive Bayes Model\n", "\n", "The *Bernoulli Naive Bayes* model $P_\\theta(x,y)$ is defined as follows:\n", "* The distribution over classes is [Categorical](https://en.wikipedia.org/wiki/Categorical_distribution), denoted $\\text{Categorical}(\\phi_1, \\phi_2, ..., \\phi_K)$. Thus, $P_\\theta(y=k) = \\phi_k$.\n", "* The conditional probability of the data under class $k$ factorizes as $P_\\theta(x|y=k) = \\prod_{j=1}^d P(x_j \\mid y=k)$ (the Naive Bayes assumption), where each $P_\\theta(x_j \\mid y=k)$ is a $\\text{Bernoullli}(\\psi_{jk})$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Maximum Likelihood Learning\n", "\n", "In order to fit probabilistic models, we use the following objective:\n", "$$\\max_\\theta \\mathbb{E}_{x, y \\sim \\mathbb{P}_\\text{data}} \\log P_\\theta(x, y).$$\n", "This seeks to find a model that assigns high probability to the training data." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's use maximum likelihood to fit the Bernoulli Naive Bayes model. Note that model parameterss $\\theta$ are the union of the parameters of each sub-model:\n", "$$\\theta = (\\phi_1, \\phi_2,\\ldots, \\phi_K, \\psi_{11}, \\psi_{21}, \\ldots, \\psi_{dK}).$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Learning a Bernoulli Naive Bayes Model\n", "\n", "Given a dataset $\\mathcal{D} = \\{(x^{(i)}, y^{(i)})\\mid i=1,2,\\ldots,n\\}$, we want to optimize the log-likelihood $\\ell(\\theta) = \\log L(\\theta)$:\n", "\\begin{align*}\n", "\\ell(\\theta) & = \\sum_{i=1}^n \\log P_\\theta(x^{(i)}, y^{(i)}) = \\sum_{i=1}^n \\log P_\\theta(x^{(i)} | y^{(i)}) + \\sum_{i=1}^n \\log P_\\theta(y^{(i)}) \\\\\n", "& = \\sum_{k=1}^K \\sum_{j=1}^d \\underbrace{\\sum_{i :y^{(i)} =k} \\log P(x^{(i)}_j | y^{(i)} ; \\psi_{jk})}_\\text{all the terms that involve $\\psi_{jk}$} + \\underbrace{\\sum_{i=1}^n \\log P(y^{(i)} ; \\vec \\phi)}_\\text{all the terms that involve $\\vec \\phi$}.\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that each parameter $\\psi_{jk}$ is found in only one subset of terms and the $\\phi_k$ are also in the same set of terms." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "As in Gaussian Discriminant Analysis, the log-likelihood decomposes into a sum of terms. To optimize for some $\\psi_{jk}$, we only need to look at the set of terms that contain $\\psi_{jk}$:\n", "$$\\arg\\max_{\\psi_{jk}} \\ell(\\theta) = \\arg\\max_{\\psi_{jk}} \\sum_{i :y^{(i)} =k} \\log p(x^{(i)}_j | y^{(i)} ; \\psi_{jk}).$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Similarly, optimizing for $\\vec \\phi = (\\phi_1, \\phi_2, \\ldots, \\phi_K)$ only involves a single term:\n", "$$\\max_{\\vec \\phi} \\sum_{i=1}^n \\log P_\\theta(x^{(i)}, y^{(i)} ; \\theta) = \\max_{\\vec\\phi} \\sum_{i=1}^n \\log P_\\theta(y^{(i)} ; \\vec \\phi).$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Optimizing the Model Parameters\n", "\n", "These observations greatly simplify the optimization of the model. Let's first consider the optimization over $\\vec \\phi = (\\phi_1, \\phi_2, \\ldots, \\phi_K)$. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As in Gaussian Discriminant Analysis, we can take a derivative over $\\phi_k$ and set it to zero to obtain \n", "$$\\phi_k = \\frac{n_k}{n}$$\n", "for each $k$, where $n_k = |\\{i : y^{(i)} = k\\}|$ is the number of training targets with class $k$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Thus, the optimal $\\phi_k$ is just the proportion of data points with class $k$ in the training set!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Similarly, we can maximize the likelihood for the other parameters to obtain closed form solutions:\n", "\\begin{align*}\n", "\\psi_{jk} = \\frac{n_{jk}}{n_k}.\n", "\\end{align*}\n", "where $|\\{i : x^{(i)}_j = 1 \\text{ and } y^{(i)} = k\\}|$ is the number of $x^{(i)}$ with label $k$ and a positive occurrence of word $j$.\n", "\n", "Each $\\psi_{jk}$ is simply the proportion of documents in class $k$ that contain the word $j$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Querying the Model\n", "\n", "How do we ask the model for predictions? As discussed earler, we can apply Bayes' rule:\n", "$$\\arg\\max_y P_\\theta(y|x) = \\arg\\max_y P_\\theta(x|y)P(y).$$\n", "Thus, we can estimate the probability of $x$ and under each $P_\\theta(x|y=k)P(y=k)$ and choose the class that explains the data best." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classification Dataset: Twenty Newsgroups\n", "\n", "To illustrate the text classification problem, we will use a popular dataset called 20-newsgroups. \n", "* It contains ~20,000 documents collected approximately evenly from 20 different online newsgroups.\n", "* Each newgroup covers a different topic such as medicine, computer graphics, or religion.\n", "* This dataset is widely used to benchmark text classification and other types of algorithms." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's load this dataset." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _20newsgroups_dataset:\n", "\n", "The 20 newsgroups text dataset\n", "------------------------------\n", "\n", "The 20 newsgroups dataset comprises around 18000 newsgroups posts on\n", "20 topics split in two subsets: one for training (or development)\n", "and the other one for testing (or for performance evaluation). The split\n", "between the train and test set is based upon a messages posted before\n", "and after a specific date.\n", "\n", "This module contains two loaders. The first one,\n", ":func:sklearn.datasets.fetch_20newsgroups,\n", "returns a list of the raw texts that can be fed to text feature\n", "extractors such as :class:sklearn.feature_extraction.text.CountVectorizer\n", "with custom parameters so as to extract feature vectors.\n", "The second one, :func:sklearn.datasets.fetch_20newsgroups_vectorized,\n", "returns ready-to-use features, i.e., it is not necessary to use a feature\n", "extractor.\n", "\n", "**Data Set Characteristics:**\n", "\n", " ================= ==========\n", " Classes 20\n", " Samples total 18846\n", " Dimensionality 1\n", " Features text\n", " ================= ==========\n", "\n", "Usage\n", "~~~~~\n", "\n", "\n" ] } ], "source": [ "#https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html\n", " \n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.datasets import fetch_20newsgroups\n", "\n", "# for this lecture, we will restrict our attention to just 4 different newsgroups:\n", "categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']\n", "\n", "# load the dataset\n", "twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)\n", "\n", "# print some information on it\n", "print(twenty_train.DESCR[:1100])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Example: Text Classification\n", "\n", "Let's see how this approach can be used in practice on the text classification dataset.\n", "* We will learn a good set of parameters for a Bernoulli Naive Bayes model\n", "* We will compare the outputs to the true predictions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's see an example of this approach on 20-newsgroups.\n", "\n", "We start by computing these features using the sklearn library." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(2257, 1000)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "# vectorize the training set\n", "count_vect = CountVectorizer(binary=True, max_features=1000)\n", "y_train = twenty_train.target\n", "X_train = count_vect.fit_transform(twenty_train.data).toarray()\n", "X_train.shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's compute the maximum likelihood model parameters on our dataset." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.21267169 0.25875055 0.26318121 0.26539654]\n" ] } ], "source": [ "# we can implement these formulas over the Iris dataset\n", "n = X_train.shape # size of the dataset\n", "d = X_train.shape # number of features in our dataset\n", "K = 4 # number of clases\n", "\n", "# these are the shapes of the parameters\n", "psis = np.zeros([K,d])\n", "phis = np.zeros([K])\n", "\n", "# we now compute the parameters\n", "for k in range(K):\n", " X_k = X_train[y_train == k]\n", " psis[k] = np.mean(X_k, axis=0)\n", " phis[k] = X_k.shape / float(n)\n", "\n", "# print out the class proportions\n", "print(phis)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can compute predictions using Bayes' rule." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 1 3 0 3 3 3 2 2 2]\n" ] } ], "source": [ "# we can implement this in numpy\n", "def nb_predictions(x, psis, phis):\n", " \"\"\"This returns class assignments and scores under the NB model.\n", " \n", " We compute \\arg\\max_y p(y|x) as \\arg\\max_y p(x|y)p(y)\n", " \"\"\"\n", " # adjust shapes\n", " n, d = x.shape\n", " x = np.reshape(x, (1, n, d))\n", " psis = np.reshape(psis, (K, 1, d))\n", " \n", " # clip probabilities to avoid log(0)\n", " psis = psis.clip(1e-14, 1-1e-14)\n", " \n", " # compute log-probabilities\n", " logpy = np.log(phis).reshape([K,1])\n", " logpxy = x * np.log(psis) + (1-x) * np.log(1-psis)\n", " logpyx = logpxy.sum(axis=2) + logpy\n", "\n", " return logpyx.argmax(axis=0).flatten(), logpyx.reshape([K,n])\n", "\n", "idx, logpyx = nb_predictions(X_train, psis, phis)\n", "print(idx[:10])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can measure the accuracy on the training set:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.8692955250332299" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(idx==y_train).mean()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'OpenGL on the GPU is fast' => comp.graphics\n" ] } ], "source": [ "docs_new = ['OpenGL on the GPU is fast']\n", "\n", "X_new = count_vect.transform(docs_new).toarray()\n", "predicted, logpyx_new = nb_predictions(X_new, psis, phis)\n", "\n", "for doc, category in zip(docs_new, predicted):\n", " print('%r => %s' % (doc, twenty_train.target_names[category]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Algorithm: Bernoulli Naive Bayes\n", "\n", "* __Type__: Supervised learning (multi-class classification)\n", "* __Model family__: Mixtures of Bernoulli distributions\n", "* __Objective function__: Log-likelihood.\n", "* __Optimizer__: Closed form solution." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ " \n", "# Part 4: Discriminative vs. Generative Algorithms\n", "\n", "We conclude our lectures on generative algorithms by revisting the question of how they compare to discriminative algorithms." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Generative Models\n", "\n", "There are two types of probabilistic models: *generative* and *discriminative*.\n", "\\begin{align*}\n", "\\underbrace{P_\\theta(x,y) : \\mathcal{X} \\times \\mathcal{Y} \\to [0,1]}_\\text{generative model} & \\;\\; & \\underbrace{P_\\theta(y|x) : \\mathcal{X} \\times \\mathcal{Y} \\to [0,1]}_\\text{discriminative model}\n", "\\end{align*}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Given a new datapoint $x'$, we can match it against each class model and find the class that looks most similar to it:\n", "\\begin{align*}\n", "\\arg \\max_y \\log p(y | x) = \\arg \\max_y \\log \\frac{p(x | y) p(y)}{p(x)} = \\arg \\max_y \\log p(x | y) p(y),\n", "\\end{align*}\n", "where we have applied Bayes' rule in the second equation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Review: Gaussian Discriminant Model\n", "\n", "The GDA algorithm defines the following model family.\n", "* The probability $P(x\\mid y=k)$ of the data under class $k$ is a [multivariate Gaussian](https://en.wikipedia.org/wiki/Multivariate_normal_distribution) $\\mathcal{N}(x; \\mu_k, \\Sigma_k)$ with parameters\n", "$\\mu_k, \\Sigma_k$.\n", "* The distribution over classes is [Categorical](https://en.wikipedia.org/wiki/Categorical_distribution), denoted $\\text{Categorical}(\\phi_1, \\phi_2, ..., \\phi_K)$. Thus, $P_\\theta(y=k) = \\phi_k$." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Thus, $P_\\theta(x,y)$ is a mixture of $K$ Gaussians:\n", "$$P_\\theta(x,y) = \\sum_{k=1}^K P_\\theta(y=k) P_\\theta(x|y=k) = \\sum_{k=1}^K \\phi_k \\mathcal{N}(x; \\mu_k, \\Sigma_k)$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classification Dataset: Iris Flowers\n", "\n", "To look at properties of generative algorithms, let's look again at the Iris flower dataset. \n", "\n", "It's a classical dataset originally published by [R. A. Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
\n", "