```
import os
import torch
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_val_score
```

During this experiment, we will investigate the maths and implement Decision Trees, Random Forests and Adaboost. We will use Pytorch to create and train these algorithms to predict heart diseases.

Before starting, read the previous article of this series if you haven’t already: **Clustering Methods From Scratch With PyTorch**.

# Heart Disease Dataset

The models will be trained using the Heart Disease Dataset. Fourteen features were selected among 76 attributes:

- age: age in years
- sex:
- 1: male
- 0: female

- cp: chest pain type
- 0: asymptomatic
- 1: atypical angina
- 2: non-anginal pain
- 3: typical angina

- trestbps: resting blood pressure
- chol: serum cholestoral in mg/dl
- fbs: fasting blood sugar > 120 mg/dl
- restecg: resting electrocardiographic results
- 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
- 1: normal
- 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)

- thalach: maximum heart rate achieved
- exang: exercise induced angina
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
- 1: upsloping
- 2: flat
- 3: downsloping

- ca: number of major vessels (0-3) colored by flourosopy
- thal:
- 3 : fixed defect
- 6 : normal
- 7 : reversable defect

- disease:
- 0: artery diameter narrowing < 50%
- 1-3: artery diameter narrowing > 50%, close to 3 is very severe

# Data Preparation

## Load data

`= pd.read_csv(os.path.join('data', 'processed.cleveland.data')) hdisease `

```
# The data does not include the columns names, we did a little trick to add them
= {
columns '63.0' : 'age',
'1.0' : 'sex',
'1.0.1' : 'cp',
'145.0' : 'trestbps',
'233.0' : 'chol',
'1.0.2' : 'fbs',
'2.0' : 'restecg',
'150.0' : 'thalach',
'0.0' : 'exang',
'2.3' : 'oldpeak',
'3.0' : 'slope',
'0.0.1' : 'ca',
'6.0' : 'thal',
'0' : 'disease'
}=columns, inplace=True) hdisease.rename(columns
```

## Replacing missing values

```
# Remove the severity of the disease, 0: healthy, 1: sick
'disease'] > 1, 'disease'] = 1
hdisease.loc[hdisease[
# Setting gather unknown value and assign a new class for them
'ca'] == '?', 'ca'] = 4.0
hdisease.loc[hdisease['thal'] == '?', 'thal'] = 0.0
hdisease.loc[hdisease[
# Setting all values to float
= hdisease.astype(float) hdisease
```

` hdisease.head()`

age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | disease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 67.0 | 1.0 | 4.0 | 160.0 | 286.0 | 0.0 | 2.0 | 108.0 | 1.0 | 1.5 | 2.0 | 3.0 | 3.0 | 1.0 |

1 | 67.0 | 1.0 | 4.0 | 120.0 | 229.0 | 0.0 | 2.0 | 129.0 | 1.0 | 2.6 | 2.0 | 2.0 | 7.0 | 1.0 |

2 | 37.0 | 1.0 | 3.0 | 130.0 | 250.0 | 0.0 | 0.0 | 187.0 | 0.0 | 3.5 | 3.0 | 0.0 | 3.0 | 0.0 |

3 | 41.0 | 0.0 | 2.0 | 130.0 | 204.0 | 0.0 | 2.0 | 172.0 | 0.0 | 1.4 | 1.0 | 0.0 | 3.0 | 0.0 |

4 | 56.0 | 1.0 | 2.0 | 120.0 | 236.0 | 0.0 | 0.0 | 178.0 | 0.0 | 0.8 | 1.0 | 0.0 | 3.0 | 0.0 |

# Exploratory Data Analysis

## Correlation Heatmap

```
=(20, 8))
plt.figure(figsize=True, square=True, fmt='0.2f') sns.heatmap(hdisease.corr(), annot
```

`<matplotlib.axes._subplots.AxesSubplot at 0x20900bb87c8>`

## Bar Plots

```
= plt.subplots(2, 2, figsize=(12, 10))
_, axes ='disease', y='exang', data=hdisease, ax=axes[0, 0])
sns.barplot(x='disease', y='cp', data=hdisease, ax=axes[0, 1])
sns.barplot(x='disease', y='thalach', data=hdisease, ax=axes[1, 0])
sns.barplot(x='disease', y='chol', data=hdisease, ax=axes[1, 1]) sns.barplot(x
```

`<matplotlib.axes._subplots.AxesSubplot at 0x20902f1b088>`

Most features are highly correlated to the disease column. By interpreting the barplots of healthy and sick patients, it is likely that a patient with heart disease will have the following caracteristics:

- cp: more chest pain
- thalach: a lower maximum heart rate, surprisingly, high beats per minute is a signal of heart health
- exang: increase pain when exercising
- cholesterol levels are similar through sick and healthy patient: no direct link with heart disease

# Decision Trees

Decision trees are defined by recursively partitioning the input space into regions. Each region is partitioned into sub-regions up to a leaf node. Space is sub-divided into non overlapping regions based on some criteria at each node. Once created, a tree can be navigated with a new sample of data following each branch according to the conditions. Each leaf node maps the input to a class or continuous value depending on the task. To understand how to grow a tree, we must introduce the gini index first. Assuming that \(D\) is the data in a leaf and \(N\) the number of samples in \(D\), we estimate the class-conditional probability as follow: \[ \hat{\pi}_c = \frac{1}{N} \sum_{i \in D}\mathbb{I}(y_i = c) \] where \(c\) are the classes of the target. The Gini index of a given leaf is: \[ G_{l} = \sum_{c=1}^{C}\hat{\pi}_c(1 -\hat{\pi}_c) = 1 - \sum_{c=1}^{C}\hat{\pi}^2_c \]

To have the Gini index of a split, we compute the Gini index on each of its leaves and multiply each of them by the proportion of samples in each leaves and sum both values:

\[ G_{Node} = \sum_{i=1}^{K} \frac{k_{li}}{U}G_{li} \] where \(K\) is the number of leaves of the node, \(k_{li}\) is the number of samples in the leaf \(i^{th}\) leaf and \(U\) is the number of sample to split.

In the case of categorical inputs, the most common approach is to consider the split condition as \(x_{ij} = c_k\) or \(x_{ij} \neq c_k\), for each possible class label \(c_k\). The rule to perform a split on a node is as follows:

- if the node is in the maximum depth of the tree, keep it as a leaf.
- for each attribute, we try all the possible thresholds and compute the Gini index of each split, the threshold across all attributes that minimize the Gini index is selected.
- a node is said pure if the split generates a empty leaf, if a node is pure, it is set as a leaf.
- if the node has a lowest score than the Gini index of the best split, we keep it as a leaf.

```
# Check which data are categorical or not
= []
is_categorical for col in hdisease.columns:
# If less than 9 unique values, than the feature is categorical
int(len(np.unique(hdisease[col])) < 9))
is_categorical.append(# Convert to Pytorch tensor
= torch.tensor(is_categorical)[:-1] is_categorical
```

```
# Convert data to numpy
= hdisease.values
data # Get unique values of the target
= torch.from_numpy(np.unique(data[:, -1])) utarget
```

```
= torch.from_numpy(data[:, :-1])
X = torch.from_numpy(data[:, -1]).int() y
```

```
class Node():
"""Base class for a tree node
We will use a single class to have the options to build a Decision Tree,
a Random Forest and Adaboost.
Attributes:
X: A torch tensor for the data.
y: A torch tensor for the labels.
is_categorical_list: A list of boolean, 1 if the features is categorical,
0 otherwise.
max_depth: A positive integer for the depth of the tree.
columns: A list of the name of each column.
depth: A positive integer for the depth of the node.
nb_features: A positive integer for selecting a number of random
features for the split. Only used when training a Random Forest.
weights: A torch tensor for computing the weighted gini index.
Only used when training Adaboost.
col_perm: A torch tensor of the permuted column. Only for Random Forests.
X_full: A torch tensor that is keeping the whole data in case we shuffle
the input for training a Random Forest.
col_full: A list that keeps the whole feature names. Only for Random Forests.
is_catfull: A list that keeps the whole is_categorical_list if training a
Random Forest.
size: A positive integer for the number of samples.
nb_features: A positive integer for the number of features
cutoff: A float for the splitting value of the node.
col: A positive integer as an index for which column to apply the condition
for the split.
left: A Node class for the left child node.
right: A Node class for the right child node.
"""
def __init__(self, X, y, is_categorical, max_depth, columns, depth=0, nb_features=0, weights=None):
"""Init function of the class
This function builds the whole tree, recursively creating each child node down
to the leaves.
Args:
Described in the attribute section of the class
"""
# If training random forest
if nb_features != 0:
self.col_perm = torch.randperm(X.shape[1])[:nb_features]
# We have to keep the permutated data as well as the full data
# because we have to pass the full data to the child nodes
self.X = X[:, self.col_perm]
self.X_full = X
self.columns = columns[self.col_perm]
self.col_full = columns
self.is_categorical_list = is_categorical[self.col_perm]
self.is_catfull = is_categorical
else:
# Regular training of the decision tree
self.X = X
self.columns = columns
self.is_categorical_list = is_categorical
# Weights are used to compute a weighted gini index
# Only when training AdaBoost
self.weights = weights
self.y = y
self.size = float(X.shape[0])
# Prediction if the node will turn into a leaf, or split value
self.cutoff = None
# Column to check for splitting the data
self.col = None
# Child nodes
self.left = None
self.right = None
# Whether or not the split value is categorical
self.depth = depth
self.nb_features = nb_features
# If the node contains only one label, it is set as a leaf
if self.is_pure() or depth == max_depth or self.X.shape[0] == 1:
# Select the predominant categorical value and set it as the prediction
self.make_leaf()
return
# Computing the gini index on the population before the split
= self.gini(y, self.weights)
gini_n = self.find_best_split()
params
= params[0]
gini_s # If no improvement, make the node as a leaf
if gini_s >= gini_n:
self.make_leaf()
else:
self.make_split(params, max_depth)
def gini(self, y, weights=None):
"""Computes the gini index
Args:
y: A torch tensor for the labels of the data.
weights: If training a Random Forest, the weights associated
with each sample.
"""
if weights is None:
# Regular gini index
= y.bincount() / float(y.shape[0])
pi else:
# Weighted gini index
= self.get_weighted_population(y, weights)
pi return 1 - pi.pow(2).sum()
# For weighted gini index:
def get_weighted_population(self, y, weights):
"""Computes the weighted gini index
Instead of counting the samples for each class
we count the weights of each class and divide by the sum of the weights
Args:
y: A torch tensor for the labels of the data.
weights: If training a Random Forest, the weights associated
"""
= torch.zeros((2,), dtype=torch.float32)
pi = torch.where(y == 0)[0]
idx_0 = torch.where(y == 1)[0]
idx_1 0] = weights[idx_0].sum()
pi[1] = weights[idx_1].sum()
pi[return pi / weights.sum()
def clean_attributes(self):
"""Cleans variables that are not usefull to predict"""
del(self.X)
del(self.y)
del(self.weights)
del(self.size)
if self.nb_features != 0:
del(self.X_full)
del(self.col_full)
del(self.is_catfull)
del(self.colperm)
del(self.nb_features)
def is_pure(self):
"""Checks if the node is pure
The node is pure if there is only one label in the population
"""
return len(self.y.unique()) == 1
def get_label(self):
"""Returns the most present label as a prediction if the node turns to a leaf"""
if self.weights is None:
return self.y.bincount().argmax().item()
else:
return self.get_weighted_population(self.y, self.weights).argmax().item()
def make_leaf(self):
"""Makes the node a leaf"""
self.cutoff = self.get_label()
self.clean_attributes()
def make_split(self, params, max_depth):
"""Performs the split
Args:
params: See find_best_split function
max_depht: A positive integer for the maximum of the tree.
"""
self.col = params[1]
self.cutoff = params[2]
# Save the categorical boolean of the selected column for the predict method
self.var_categorical = self.is_categorical_list[self.col].item()
# Recursively split by creating two instances of the Node class using the two groups
if self.nb_features != 0:
= self.col_full
cols = self.is_catfull
categorical_list else:
= self.columns
cols = self.is_categorical_list
categorical_list # Creating child nodes based on the best params
# If training Random Forest, we pass nb_features
# If training AdaBoost, we pass the weights
self.left = Node(X=params[3][0],
=params[3][1],
y=categorical_list,
is_categorical=max_depth,
max_depth=cols,
columns=self.depth + 1,
depth=self.nb_features,
nb_features=params[5])
weightsself.right = Node(X=params[4][0],
=params[4][1],
y=categorical_list,
is_categorical=max_depth,
max_depth=cols,
columns=self.depth + 1,
depth=self.nb_features,
nb_features=params[6])
weightsself.clean_attributes()
def gini_split(self, idx_g_1, idx_g_2, cutoff, feature_idx, best_params):
"""Computes the gini index of the future split
Args:
idx_g_1: A torch tensor for the indices of the first group.
idx_g_2: A torch tensor for the indices of the second group.
cutoff: A float for the split value.
feature_idx: A positive integer for the index of the feature to test.
best_params: See function find_best_split
"""
= self.y[idx_g_1].squeeze(1)
g_1 = self.y[idx_g_2].squeeze(1)
g_2 if self.weights is None:
#Gini index
= (float(g_1.shape[0]) / self.size) * self.gini(g_1)
gini_g1 = (float(g_2.shape[0]) / self.size) * self.gini(g_2)
gini_g2 else:
# Weighted gini index
= self.weights[idx_g_1]
g_1_w = self.weights[idx_g_2]
g_2_w = self.weights.sum()
w_sum = (g_1_w.sum() / w_sum) * self.gini(g_1, g_1_w)
gini_g1 = (g_2_w.sum() / w_sum) * self.gini(g_2, g_2_w)
gini_g2
= (gini_g1 + gini_g2)
gini_split if gini_split < best_params[0]:
0] = gini_split.item()
best_params[1] = feature_idx
best_params[2] = cutoff.item()
best_params[# If training a base learner of a random forest
# pass the full data to child nodes
if self.nb_features != 0:
3] = [self.X_full[idx_g_1].squeeze(1), g_1]
best_params[4] = [self.X_full[idx_g_2].squeeze(1), g_2]
best_params[else:
3] = [self.X[idx_g_1].squeeze(1), g_1]
best_params[4] = [self.X[idx_g_2].squeeze(1), g_2]
best_params[if self.weights is not None:
# Gather weights of each groups to pass them to the child nodes
# for their own weighted gini index
5] = g_1_w
best_params[6] = g_2_w
best_params[return best_params
def find_best_split(self):
"""Finds the best split
Creates a parameter list to store the parameters of the best split
It contains:
0: best gini index
1: column index of the best split
2: value of the best split
3: left group [X, y], less than equal to #3 or belongs to the class #3 if categorical
4: right group [X, y], greater than #3 or does not belong to the class #3
5 : weights of the first group
6 : weights of the second group
"""
= [2, -1, -1, None, None, None, None]
best_params for i in range(self.X.shape[1]):
= self.X[:, i]
vals if self.is_categorical_list[i]:
for cutoff in vals.unique():
= (vals == cutoff).nonzero()
idx_uv = (vals != cutoff).nonzero()
idx_uv_not = self.gini_split(idx_uv, idx_uv_not, cutoff, i, best_params)
best_params else:
for cutoff in vals.unique():
= (vals <= cutoff).nonzero()
idx_leq = (vals > cutoff).nonzero()
idx_ge = self.gini_split(idx_leq, idx_ge, cutoff, i, best_params)
best_params return best_params
def get_dict(self):
"""Returns a dictionary containing nodes and their information"""
= {}
node_dict if self.left is None and self.right is None:
'pred'] = self.cutoff
node_dict[else:
'cutoff'] = self.cutoff
node_dict['feature'] = self.columns[self.col]
node_dict['categorical'] = self.var_categorical
node_dict['left'] = self.left.get_dict()
node_dict['right'] = self.right.get_dict()
node_dict[return node_dict
def predict(self, sample):
"""Takes a single input and predicts its class
Follows the tree based on the conditions
"""
if self.nb_features != 0:
= sample[self.col_perm]
sample_in else:
= sample
sample_in if self.left is None and self.right is None:
return self.cutoff
if self.var_categorical:
if sample_in[self.col] == self.cutoff:
return self.left.predict(sample)
else:
return self.right.predict(sample)
else:
if sample_in[self.col] <= self.cutoff:
return self.left.predict(sample)
else:
return self.right.predict(sample)
```

## Decision Tree Class

```
class DecisionTreeClassifier(BaseEstimator):
"""Class for the Decision Tree Classifier
Class that implements the sklearn methods
Just a wrapper for our node class
Attributes:
max_depth: A positive integer for the maximum depth of the tree.
columns: A list of the name of each column.
nb_features: A positive integer for the number of features
is_categorical: A list of boolean, 1 if the features is categorical,
"""
def __init__(self, max_depth, is_categorical, columns, nb_features=0):
"""Inits the Decision Tree class
Args:
See attributes section.
"""
if nb_features < 0:
raise ValueError('negative integer passed to nb_features.')
self.max_depth = max_depth
# Wether or not each column is a categorical value
self.is_categorical = is_categorical
self.columns = columns
self.root = None
# Number of random features to select
# Only used when building a random forest base learner
# If 0 then train a decision tree using all the features availables
self.nb_features = nb_features
def fit(self, X, y, **kwargs):
"""Trains the model
Needs to get the 'sample_weight' key in kwargs.
Mandatory for using sklearn cross validation.
Args:
X: A torch tensor for the data.
y: A torch tensor for the labels.
"""
if self.nb_features > X.shape[1]:
raise ValueError('parameter np_features should be less than equal to the number of features')
if 'sample_weight' in kwargs.keys():
= kwargs['sample_weight']
weights else:
= None
weights self.root = Node(X, y, self.is_categorical, self.max_depth, self.columns, 0, self.nb_features, weights)
def predict(self, X):
"""Predicts the labels of a batch of input samples"""
if len(X.shape) == 0:
return 'error: can not predict on empty input.'
if len(X.shape) == 1:
return self.root.predict(X)
= torch.zeros((X.shape[0],), dtype=torch.int32)
pred if self.root == None:
return 'error: use the fit method before using predict.'
for i in range(X.shape[0]):
= X[i, :]
sample = self.root.predict(sample)
pred[i] return pred
def __str__(self):
"""Method for printing the tree"""
if self.root == None:
return 'error: use the fit method to print the result of the training.'
= self.root.get_dict()
tree_dict
pprint(tree_dict)return ''
```

```
= DecisionTreeClassifier(max_depth=3,
dt =is_categorical,
is_categorical=hdisease.columns,
columns=13)
nb_features
dt.fit(X, y)print(dt)
```

```
{'categorical': 1,
'cutoff': 3.0,
'feature': 'thal',
'left': {'categorical': 1,
'cutoff': 0.0,
'feature': 'ca',
'left': {'categorical': 0,
'cutoff': 156.0,
'feature': 'trestbps',
'left': {'pred': 0},
'right': {'pred': 1}},
'right': {'categorical': 1,
'cutoff': 4.0,
'feature': 'cp',
'left': {'pred': 1},
'right': {'pred': 0}}},
'right': {'categorical': 1,
'cutoff': 4.0,
'feature': 'cp',
'left': {'categorical': 0,
'cutoff': 0.5,
'feature': 'oldpeak',
'left': {'pred': 1},
'right': {'pred': 1}},
'right': {'categorical': 1,
'cutoff': 0.0,
'feature': 'ca',
'left': {'pred': 0},
'right': {'pred': 1}}}}
```

We just fit the whole data to the model to visualize the structure of the tree. In the next section, we will validate the performance of the model using cross validation and have an interpretation of the tree structure.

## Cross Validation

```
= DecisionTreeClassifier(max_depth=3,
dt =is_categorical,
is_categorical=hdisease.columns,
columns=13)
nb_features# Using 5 K-folds
= cross_val_score(dt, X, y, scoring='accuracy', cv=5).mean()
test_acc print('Accuracy on the test set:', test_acc)
```

`Accuracy on the test set: 0.8110382513661202`

Using cross validation and accuracy as our metric, the model performed well for such a simple classifier. Indeed, 81% accuracy on a 5 k-fold cross validation is acceptable given the complexity of predicting heart disease. Looking at the tree structure, the first split is done on the ‘thal’ feature which has the highest correlation with the target. The most discriminative features are then ‘ca’, ‘cp’ and ‘oldpeak’ and they all have more than 40% correlation with the target. The model uses lower correlated variable such as ‘chol’ and ‘testbps’ inside the deeper nodes.

The ‘thal’ feature contributes the most in the decision. It refers to the Thalium stress test result, consisting in injecting radioactive element into the bloodstream of the patient. The ‘thal’ categorical values refers to the quality of blood circulation. It makes sense that a patient with poor blood circulation is very likely to have a heart disease.

The performance using a decision tree is acceptable, but let’s try to do better using a Random Forest.

# Random Forest

Random forest is a technique known as bootstrap aggregating (bagging) which consists in taking multiple different models and compute the ensemble prediction:

\[\begin{align} f(\boldsymbol{x}) = \sum_{i = 0}^{M}f_m(\boldsymbol{x}) \end{align}\]

where \(f_m\) is the \(m^{th}\) model. The random forest technique is an ensemble model using decision trees as base learners. This ensemble model tries to decorrelate the base learners by learning trees on a randomly chosen subset of features, as well as a randomly chosen subset of samples.

In other words, to train a random forest, we create \(m\) subsets of the training set where we can select multiple times the same samples. These subsets are known as boostrapped datasets. Finally, each base learner is trained in the same fashion as a decision tree except that at each node, when we perform the split, we randomly choose a fixed number of features from the data.

```
class RandomForestClassifier(BaseEstimator):
"""Class for the Random Forest model
Implements the BaseEstimator sklearn class to use cross validation.
Attributes:
max_depth: A positive integer for the maximum depth of the tree.
columns: A list of the name of each column.
nb_features: A positive integer for the number of features
is_categorical: A list of boolean, 1 if the features is categorical,
n_estimators: A positive integer for the number of estimators in the ensemble.
trees: A list of DecisionTreeClassifiers class of the ensemble.
"""
def __init__(self, max_depth, is_categorical, columns, nb_features, n_estimators):
"""Inits the Random Forest class
Args:
See attribute section.
"""
self.max_depth = max_depth
self.is_categorical = is_categorical
self.columns = columns
self.nb_features = nb_features
# Number of trees in the ensemble model
self.n_estimators = n_estimators
self.trees = []
def fit(self, X, y, **kwargs):
"""Trains the random forest"""
if self.nb_features > X.shape[1]:
raise ValueError('parameter np_features should be less than equal to the number of features')
for i in range(self.n_estimators):
# Create boostrapped subset
= self.get_boostrap_data(X, y)
X_bstrp, y_bstrp = DecisionTreeClassifier(self.max_depth, self.is_categorical, self.columns, self.nb_features)
dt # Train on the boostrapped subst
dt.fit(X_bstrp, y_bstrp)self.trees.append(dt)
def get_boostrap_data(self, X, y):
"""Returns the data and labels randomly shuffled"""
= torch.randint(0, X.shape[0] - 1, size=(X.shape[0],))
idx return X[idx, :], y[idx]
def predict_sample(self, X):
"""Predicts on one sample"""
= np.array([0, 0])
preds # Predict with each base learner and return the most voted class
for tree in self.trees:
+= 1
preds[tree.predict(X)] return preds.argmax()
def predict(self, X):
"""Predicts on a batch of samples"""
if len(X.shape) == 0:
return 'error: can not predict on empty input.'
if len(X.shape) == 1:
return self.predict_sample(X)
= torch.zeros((X.shape[0],), dtype=torch.int32)
pred if len(self.trees) == 0:
return 'error: use the fit method before using predict.'
for i in range(X.shape[0]):
= X[i, :]
sample = self.predict_sample(sample)
pred[i] return pred
```

```
= [3, 4]
max_depth_params = [i for i in range(2, 10)]
nb_features_params for _ in range(20):
= np.random.choice(max_depth_params)
max_depth = np.random.choice(nb_features_params)
nb_features = RandomForestClassifier(max_depth=max_depth,
rf =is_categorical,
is_categorical=hdisease.columns,
columns=nb_features,
nb_features=100)
n_estimators# Using 5 K-folds
= cross_val_score(rf, X, y, scoring='accuracy', cv=5).mean()
test_acc print('Test accuracy:\t' + str(test_acc) + ', nb features:\t' + str(nb_features) + ', max depth:\t', str(max_depth))
```

```
Test accuracy: 0.8342622950819673, nb features: 7, max depth: 3
Test accuracy: 0.8243169398907104, nb features: 6, max depth: 4
Test accuracy: 0.8242076502732241, nb features: 6, max depth: 3
Test accuracy: 0.8243715846994537, nb features: 8, max depth: 3
Test accuracy: 0.8408743169398907, nb features: 6, max depth: 3
Test accuracy: 0.8275956284153005, nb features: 8, max depth: 3
Test accuracy: 0.834153005464481, nb features: 8, max depth: 3
Test accuracy: 0.8243169398907104, nb features: 4, max depth: 3
Test accuracy: 0.830928961748634, nb features: 6, max depth: 3
Test accuracy: 0.8342076502732242, nb features: 2, max depth: 3
Test accuracy: 0.8340983606557376, nb features: 7, max depth: 4
Test accuracy: 0.8175409836065575, nb features: 8, max depth: 4
Test accuracy: 0.8375956284153006, nb features: 4, max depth: 3
Test accuracy: 0.8341530054644808, nb features: 9, max depth: 3
Test accuracy: 0.8308743169398907, nb features: 5, max depth: 3
Test accuracy: 0.8275956284153004, nb features: 7, max depth: 4
Test accuracy: 0.8243169398907104, nb features: 2, max depth: 4
Test accuracy: 0.8441530054644808, nb features: 4, max depth: 4
Test accuracy: 0.8375409836065575, nb features: 3, max depth: 4
Test accuracy: 0.8342076502732242, nb features: 7, max depth: 3
```

After a quick parameter search, the best model achieves 84.4% accuracy. A great improvement of 3% comparing to a single decision tree. To go further we will try a technique known as boosting.

# Gradient Boosting

Boosting is an algorithm using multiple weak learners. A weak learner is defined to be slightly better than random guessing. The most common weak learner is known as a stamp. A stamp is an univariate tree with only one split. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data.

The predictions from the weak learners are combined through a weighted majority vote: \[\begin{align} f(\boldsymbol{x}) = \text{sign}\left(\sum_{m=1}^{M}\alpha_m f_m(\boldsymbol{x})\right) \end{align}\]

where \(f(\boldsymbol{x})\) is the prediction of the final model, \(f_m(\boldsymbol{x})\) is the prediction of the weak learner \(m\) and \(\{\alpha_1, \dots, \alpha_M\}\) are the weights computed by the boosting algorithm. Their effect is to give higher influence to the more accurate learners. This formula represent the prediction in case of a binary classification \(\{-1, 1\}\)

The data modification consists of applying weights \(\{\boldsymbol{w_1}, \dots, \boldsymbol{w_N}\}\) to each training samples. Initially the weights are set to \(\frac{1}{N}\) where \(N\) is the number of samples. At iteration \(k\), we perform a data transformation using the weights on each training samples. We fit the data into the model and the samples that were misclassified by the learner \(f_{k-1}\) have their weights increased while the samples that were well classified by \(f_{k-1}\) have their weights decreased. The importance of the harder samples sequentially increase and the learners are forced to concentrate on the observation that are missed on the previous iterations.

AdaBoost, short for Adaptative Boosting, is the first boosting technique that got popular for being able to adapt to the weak learners. In this algorithm, after fitting the weak learner \(m\), we compute the weighted error: \[\begin{align} \text{err}_m = \frac{1}{N}\sum_{i=1}^{N}w_i I\left(y_i \neq f_m(\boldsymbol{x_i})\right) \end{align}\]

Then we set the weights of the model in the final prediction based on its performance: \[\begin{align} \alpha_m = \log\left(\frac{1 - \text{err}_m}{\text{err}_m}\right) \end{align}\]

Finally, we update the weights of each samples as follow:

\[\begin{align} w_i &= w_i * \exp(\alpha_m I\left(y_i \neq f_m(\boldsymbol{x}_i)\right)\\ w_i &= w_i * \exp(-\alpha_m I\left(y_i = f_m(\boldsymbol{x}_i)\right) \end{align}\]

The importance of the weak learner’s weight is proportional to its performance during training. In fact, if \(err_m\) is low then \(\alpha_m\) is high and the values of the sample weights \(w_1, \dots, w_N\) get modified by a higher margin. The misclassified samples are scaled up and the well classified samples are scaled down so that the learner will focus on the weakness of the previous one.

Inside the stamps, the splits are selected according to the weighted gini split: \[\begin{align} \hat{\pi}_c = \frac{1}{\sum_{i=1}^{N}w_i} \sum_{i \in D}w_i\mathbb{I}(y_i = c)\\ \text{G} = 1 - \sum_{c=1}^{C}\hat{\pi}^2_c \end{align}\]

Here the samples with high weights contribute more to the gini index than the lower ones.

```
class AdaBoostClassifier(BaseEstimator):
"""Class of the Adaboost model
Attributes:
columns: A list of the name of each column.
is_categorical: A list of boolean, 1 if the features is categorical,
n_estimators: A positive integer for the number of estimators in the ensemble.
learners: A list of DecisionTreeClassifiers for the stamps of the ensemble.
alphas: A torch tensor for the weights defining the importance of each learner
in the final prediction.
"""
def __init__(self, n_estimators=50, is_categorical=None, columns=None):
self.n_estimators = n_estimators
self.learners = []
self.alphas = None
def compute_weighted_err(self, y_true, y_pred, weights):
"""Computes a weighted error based on the weight associated with the samples"""
= torch.where(y_true != y_pred)[0]
idx_neq return weights[idx_neq].sum() / weights.sum()
def set_alpha(self, err):
"""Updates the new weights of the models"""
return torch.log((1 - err) / (err + 1e-10)) / 2
def update_weights(self, weights, y_true, y_pred, alpha):
"""Updates the weights of the samples
Args:
weights: A torch tensor for the weights of each samples.
y_true: A torch tensor for the labels of the data.
y_pred: A torch tensor for the predictions of the model.
alpha: A float for the weight of a learner.
"""
if len(y_true.shape) == 1:
= y_true.unsqueeze(1)
y_true if len(weights.shape) == 1:
= weights.unsqueeze(1)
weights *= torch.exp(alpha * (y_pred != y_true).int())
weights *= torch.exp(-alpha * (y_pred == y_true).int())
weights /= weights.sum()
weights return weights
def fit(self, X, y, **kwargs):
"""Trains the model
Args:
X: A torch tensor for the data.
y: A torch tensor for the labels.
"""
if len(y.shape) == 1:
= y.unsqueeze(1)
y
= X.shape[0]
size # Initialize weights to 1/size
= torch.empty((size,)).fill_(1 / size)
weights = torch.zeros((self.n_estimators, 1))
alphas
for i in range(self.n_estimators):
if len(weights.shape) > 1:
= weights.squeeze(1)
weights
# Each learner is a stamp:
# it has only one split and is using only one feature
= DecisionTreeClassifier(max_depth=1, is_categorical=is_categorical, columns=columns, nb_features=0)
dtc =weights)
dtc.fit(X, y, sample_weight= dtc.predict(X).unsqueeze(1)
pred self.learners.append(dtc)
# Compute weighted error
= self.compute_weighted_err(y, pred, weights)
err # Set weight of the learner
= self.set_alpha(err)
new_alpha = new_alpha
alphas[i] # Update weights
= self.update_weights(weights, y, pred, new_alpha)
weights self.alphas = alphas
def predict(self, X):
"""Predicts the labels of a batch of data"""
= X.shape[0]
size = torch.zeros((size, 1), dtype=torch.float32)
preds for i in range(self.n_estimators):
= self.learners[i].predict(X).unsqueeze(1)
pred # Change zeros to -ones to apply to sign formula discussed above
== 0] = -1
pred[pred += self.alphas[i] * pred
preds = torch.sign(preds)
preds == -1] = 0
preds[preds return preds
```

```
for i in range(10, 150, 10):
= AdaBoostClassifier(n_estimators=i, is_categorical=is_categorical, columns=hdisease.columns)
adaboost = {
fit_params 'n_estimators' : i,
'is_categorical' : is_categorical,
'columns' : hdisease.columns
}= cross_val_score(adaboost, X, y, scoring='accuracy', cv=5).mean()
test_acc print('Test accuracry: ' + str(test_acc) + ', n_estimators: ' + str(i))
```

```
Test accuracry: 0.8374316939890709, n_estimators: 10
Test accuracry: 0.8340437158469944, n_estimators: 20
Test accuracry: 0.837377049180328, n_estimators: 30
Test accuracry: 0.8274863387978142, n_estimators: 40
Test accuracry: 0.8440437158469946, n_estimators: 50
Test accuracry: 0.827431693989071, n_estimators: 60
Test accuracry: 0.8341530054644808, n_estimators: 70
Test accuracry: 0.8275409836065574, n_estimators: 80
Test accuracry: 0.8308743169398907, n_estimators: 90
Test accuracry: 0.8308196721311475, n_estimators: 100
Test accuracry: 0.8274863387978142, n_estimators: 110
Test accuracry: 0.8175956284153004, n_estimators: 120
Test accuracry: 0.8142622950819671, n_estimators: 130
Test accuracry: 0.8274863387978142, n_estimators: 140
```

AdaBoost achieved 84.4% accuracy on this data, the exact same performance as the random forest.

# Conclusion

Decision Trees are very prized by the machine learning community for their interpretability and their handy automatic feature selection. Unfortunately, the decision trees tend to overfit. It is mandatory to limit the max depth parameter to avoid this behavior. Using ensemble techniques is a great way to deal with the tendency of the decision trees to overfit. However, as the model is getting more complex due to aggregation of multiple learners, it is more challenging to interpret a Random Forest or an AdaBoost.

Want some more ? Feel free to read the next article of this series: **Support Vector Machines With CVXOpt**.