Lecture 12: Tree-Based Algorithms

Applied Machine Learning

Volodymyr Kuleshov
Cornell Tech

Part 1: Decision Trees

We are now going to see a different way of defining machine models called decision trees.

Review: Components of A Supervised Machine Learning Problem

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

The UCI Diabetes Dataset

To explain what is a decision tree, we are going to use the UCI diabetes dataset that we have been working with earlier.

Let's start by loading this dataset.

We can also look at the data directly.

Decision Trees: Intuition

Decision tress are machine learning models that mimic how a human would approach this problem.

  1. We start by picking a feature (e.g., age)
  2. Then we branch on the feature based on its value (e.g, age > 65?)
  3. We select and branch on one or more features (e.g., is it a man?)
  4. Then we return an output that depends on all the features we've seen (e.g., a man over 65)

Decision Trees: Example

Let's first see an example on the diabetes dataset.

We will train a decision tree using it's implementation in sklearn.

Decision Rules

Let's now define a decision tree a bit more formally. The first important concept is that of a rule.

Decision Regions

The next important concept is that of a decision region.

Decision Trees: Definition

A decision tree is a model $f : \mathcal{X} \to \mathcal{Y}$ of the form $$ f(x) = \sum_{R \in \mathcal{R}} y_R \mathbb{I}\{x \in R\}. $$

We can also illustrate decision trees via this figure from Hastie et al.

The illustrations are as follows:

Pros and Cons of Decision Trees

Decision trees are important models in machine learning

Their main disadvantages are that:

Part 2: Learning Decision Trees

We saw how decision trees are represented. How do we now learn them from data?

The Components of A Supervised Machine Learning Algorithm

We can also define the high-level structure of a supervised learning algorithm as consisting of three components:

Recall: Decision Trees

A decision tree is a model $f : \mathcal{X} \to \mathcal{Y}$ of the form $$ f(x) = \sum_{R \in \mathcal{R}} y_R \mathbb{I}\{x \in R\}. $$

We can also illustrate decision trees via this figure from Hastie et al.

Learning Decision Trees

At a high level, decision trees are grown by adding nodes one at a time.

def build_tree(tree, data):
    whlie tree.is_complete() is False:
        region, region_data = tree.get_region()
        new_rule = split_region(region_data)
        tree.add_rule(region, new_rule)

Most often, we build the tree until it reaches a maximum number of nodes. The crux of the algorithm is in split_region.

There is also a recursive formulation of this algorithm:

def build_tree(data, depth):
    if depth < MAX_DEPTH:
        # internal node
        rule, data_left, data_right = get_new_rule(tree, data)
        left_subtree = build_tree(data_left, depth+1)
        right_subtree = build_tree(data_right, depth+1)
        return create_node(new_rule, left_subtree, right_subtree)
    else:
        # leaf node
        return create_terminal_node(data)

Learning New Decision Rules

How does the split_region function choose new rule $r$? Given a dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)}\mid i =1,2,\ldots,n\}$, we greedily choose the rule that achieves the dataset split with the lowest possible loss.

This can be written as the following optimization problem: $$ \min_{r \in \mathcal{U}} \left( \underbrace{L(\{(x, y) \in \mathcal{D} \mid r(x) = \text{T}\})}_\text{left subtree} + \underbrace{L(\{(x, y) \in \mathcal{D} \mid r(x) = \text{F}\}}_\text{right subtree})\right) $$

where $L$ is a loss function over a subset of the data flagged by the rule and $\mathcal{U}$ is the set of possible rules.

What is the set of possible rules? When $x$ has continuous features, the rules have the following form: $$ r(x) = \begin{cases}\text{true} & \text{if } x_j \leq t \\ \text{false} & \text{if } x_j > t \end{cases} $$ for a feature index $j$ and threshold $t \in \mathbb{R}$.

When $x$ has categorial features, rules may have the following form: $$ r(x) = \begin{cases}\text{true} & \text{if } x_j = t_k \\ \text{false} & \text{if } x_j \neq t_k \end{cases} $$ for a feature index $j$ and possible value $t_k$ for $x_j$.

Objectives for Trees: Regression

What loss functions might we want to use? In regression, it is common to minimize the L2 error between the data and the single best prediction we can make on this data: $$ L(\mathcal{D}) = \sum_{(x, y) \in \mathcal{D}} \left( y - \texttt{average-y}(\mathcal{D}) \right)^2. $$

If this was a leaf node, we would predict $\texttt{average-y}(\mathcal{D})$, the average $y$ in the data. The above loss measures the resulting squared error.

This results in the following optimization problem for selecting a decision rule: $$ \min_{r \in \mathcal{U}} \sum_{(x, y) \in \mathcal{D} \,\mid\, r(x) = \text{true}} \left( y - p_\text{true}(r) \right)^2 + \sum_{(x, y) \in \mathcal{D} \,\mid\, r(x) = \text{false}} \left( y - p_\text{false}(r) \right)^2 $$

where $p_\text{true}(r) = \texttt{average-y}(\{(x, y) \mid (x, y) \in \mathcal{D} \text{ and } r(x) = \text{true}\})$ and $p_\text{false}(r) = \texttt{average-y}(\{(x, y) \mid (x, y) \in \mathcal{D} \text{ and } r(x) = \text{false}\})$ are the average predictions on each part of the data split.

Objectives for Trees: Classification

In classification, we may similarly use the misclassification rate $$ L(\mathcal{D}) = \sum_{(x, y) \in \mathcal{D}} \mathbb{I} \left\{ y = \texttt{most-common-y}(\mathcal{D}) \right\}. $$

If this was a leaf node, we would predict $\texttt{most-common-y}(\mathcal{D})$, the most common class $y$ in the data. The above loss measures the resulting misclassification error.

Other losses that can be used include the entropy or the gini index. These all optimize for a split in which different classes do not mix.

Other Practical Considerations

A few additional comments on the above training procedure;

Algorithm: Classification and Regression Trees (CART)

Part 3: Bagging

Next, we are going to see a general technique to improve the performance of machine learning algorithms.

We will then apply it to decision trees to define an improved algorithm.

Review: Overfitting

Overfitting is one of the most common failure modes of machine learning.

Recall this example, in which we take random samples around a true function.

Fitting High-Degree Polynomials

Let's see what happens if we fit a high degree polynomial to random samples of 20 points from this dataset.

High-Variance Models

Each small subset of the data that we train on results is a very different model.

An algorithm that has a tendency to overfit is also called high-variance, because it outputs a predictive model that varies a lot if we slightly perturb the dataset.

Bagging: Bootstrap Aggregation

The idea of bagging is to reduce model variance by averaging many models trained on random subsets of the data.

for i in range(n_models):
    # collect data samples and fit models
    X_i, y_i = sample_with_replacement(X, y, n_samples)
    model = Model().fit(X_i, y_i)
    ensemble.append(model)

# output average prediction at test time:
y_test = ensemble.average_prediction(x_test)

The data samples are taken with replacement and known as bootstrap samples.

Bagged Polynomial Regression

Let's apply bagging to our polynomial regression problem.

We are going to train a large number of polynomial regressions on random subsets of the dataset of points that we created earlier.

We start by training an ensemble of bagged models.

Let's visualize the prediction of the bagged model on each random dataset sample and compare to predictions from an un-bagged models.

Extensions of Bagging

There exist a few closely related techniques to bagging.

Summary: Bagging

Bagging is a general technique that can be used with high-variance ML algorithms.

It averages predictions from multiple models trained on random subset of the data.

Part 4: Random Forests

Next, let's see how bagging can be applied to decision trees. This will also provide us with a new algorithm.

Review: Bagging

The idea of bagging is to reduce model variance by averaging many models trained on random subsets of the data.

for i in range(n_models):
    # collect data samples and fit models
    X_i, y_i = sample_with_replacement(X, y, n_samples)
    model = Model().fit(X_i, y_i)
    ensemble.append(model)

# output average prediction at test time:
y_test = ensemble.average_prediction(y_test)

The data samples are taken with replacement and known as bootstrap samples.

Review: Decision Trees

A decision tree is a model $f : \mathcal{X} \to \mathcal{Y}$ of the form $$ f(x) = \sum_{R \in \mathcal{R}} y_R \mathbb{I}\{x \in R\}. $$

We can also illustrate decision trees via this figure from Hastie et al.

Classification Dataset: Iris Flowers

Let's now look at the performance of decision trees on a new dataset, Iris flowers.

It's a classical dataset originally published by R. A. Fisher in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

Decision Trees on the Flower Dataset

Let's now consider what happens when we train a decision tree on the Iris flower dataset.

The code below will be used to visualize predictions from decision trees on this dataset.

We may now train and visualize a decision tree on this dataset.

Two Problems With Decision Trees

We see two problems with the output of the decision tree on the Iris dataset:

High-Variance Decision Trees

When the trees have sufficiently high depth, they can quickly overfit the data.

Recall that this is called the high variance problem, because small perturbations of the data lead to large changes in model predictions.

Consider the perofmrance of a decision tree classifier on 3 random subsets of the data.

Random Forests

In order to reduce the variance of the basic decision tree, we apply bagging -- the variance reduction technique that we have seen earlier.

We refer to bagged decision trees as Random Forests.

Instantiating our definition of bagging with decision trees, we obtain the following pseudocode defintion of random forests:

for i in range(n_models):
    # collect data samples and fit models
    X_i, y_i = sample_with_replacement(X, y, n_samples)
    model = DecisionTree().fit(X_i, y_i)
    random_forest.append(model)

# output average prediction at test time:
y_test = random_forest.average_prediction(y_test)

We may implement random forests in python as follows:

Random Forests on the Flower Dataset

Consider now what happens when we deploy random forests on the same dataset as before.

Now, each prediction is the average on the set of bagged decision trees.

The boundaries are much more smooth and well-behaved.

Algorithm: Random Forests

Pros and Cons of Random Forests

Random forests remain a popular machine learning algorithm:

Their main disadvantages are that: