We are now going to see a different way of defining machine models called *decision trees*.

At a high level, a supervised machine learning problem has the following structure:

$$ \underbrace{\text{Training Dataset}}_\text{Attributes + Features} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$To explain what is a decision tree, we are going to use the UCI diabetes dataset that we have been working with earlier.

Let's start by loading this dataset.

In [4]:

```
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 4]
from sklearn import datasets
# Load the diabetes dataset
diabetes = datasets.load_diabetes(as_frame=True)
print(diabetes.DESCR)
```

We can also look at the data directly.

In [2]:

```
# Load the diabetes dataset
diabetes_X, diabetes_y = diabetes.data, diabetes.target
# create a binary risk feature
diabetes_y_risk = diabetes_y.copy()
diabetes_y_risk[:] = 0
diabetes_y_risk[diabetes_y > 150] = 1
# Print part of the dataset
diabetes_X.head()
```

Out[2]:

age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 |

1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 |

2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 |

3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 |

4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 |

Decision tress are machine learning models that mimic how a human would approach this problem.

- We start by picking a feature (e.g., age)
- Then we
*branch*on the feature based on its value (e.g, age > 65?) - We select and branch on one or more features (e.g., is it a man?)
- Then we return an output that depends on all the features we've seen (e.g., a man over 65)

Let's first see an example on the diabetes dataset.

We will train a decision tree using it's implementation in `sklearn`

.

In [3]:

```
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
# create and fit the model
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(diabetes_X.iloc[:,:4], diabetes_y_risk)
# visualize the model
plot_tree(clf, feature_names=diabetes_X.columns[:4], impurity=False)
print('')
```

Let's now define a decision tree a bit more formally. The first important concept is that of a rule.

- A decision rule $r : \mathcal{X} \to \{\text{true}, \text{false}\}$ is a partition of the feature space into two disjoint regions, e.g.: $$ r(x) = \begin{cases}\text{true} & \text{if } x_\text{bmi} \leq 0.009 \\ \text{false} & \text{if } x_\text{bmi} > 0.009 \end{cases} $$
- Normally, a rule applies to only one feature or attribute $x_j$ of $x$.
- If $x_j$ is continuous, the rule normally separates inputs $x_j$ into disjoint intervals $-\infty, c], (c, \infty)$.

The next important concept is that of a decision region.

- A decision region $R\subseteq \mathcal{X}$ is a subset of the feature space defined by the application of a set of rules $r_1, r_2, \ldots, r_m$ and their values $v_1, v_2, \ldots, v_m \in \{\text{true}, \text{false}\}$, i.e.: $$ R = \{x \in \mathcal{X} \mid r_1(x) = v_1 \text{ and } \ldots \text{ and } r_m(x) = v_m \} $$
- For example, a decision region in the diabetes problem is: $$ R = \{x \in \mathcal{X} \mid x_\text{bmi} \leq 0.009 \text{ and } x_\text{bp} > 0.004 \} $$

A decision tree is a model $f : \mathcal{X} \to \mathcal{Y}$ of the form $$ f(x) = \sum_{R \in \mathcal{R}} y_R \mathbb{I}\{x \in R\}. $$

- The $\mathbb{I}\{\cdot\}$ is an indicator function (one if $\{\cdot\}$ is true, else zero) and values $y_R \in \mathcal{Y}$ are the outputs for that region.
- The set $\mathcal{R}$ is a collection of decision regions. They are obtained by
*recursive binary splitting*. - The rules defining the regions $\mathcal{R}$ can be organized into a tree, with one rule per internal node and regions being the leaves.

In [4]:

```
plot_tree(clf, feature_names=diabetes_X.columns[:4], impurity=False)
print('')
```

We can also illustrate decision trees via this figure from Hastie et al.