```
import os
import sys
'..')
sys.path.append(import torch
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
```

During this experiment, we will train a K-nearest neighbors model on physicochemical data to predict the quality of a red or white wines.

If you haven’t already, feel free to read the previous article of this series: **Logistic Regression From Scratch With PyTorch**.

The KNN will be implemented from scratch using Pytorch along with clear explanation of how the model works. Exploratory data analysis techniques are also discussed in the last section of the notebook.

# Exploratory Data Analysis

## Data Preparation

```
= pd.read_csv(os.path.join('data', 'winequality-red.csv'), sep=';')
wine_red = pd.read_csv(os.path.join('data', 'winequality-white.csv'), sep=';') wine_white
```

` wine_red.head()`

fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |

1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |

2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |

3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |

4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |

` wine_white.head()`

fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |

1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |

2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |

3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |

4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |

Let’s merge the data from red and white wines. Both dataframes have the same labels, so we can easily concatenate them.

`= pd.concat([wine_white, wine_red]) wines `

## Correlation heatmap

```
=(16, 8))
plt.figure(figsize
= wines.corr()
wines_corr =True, square=True, fmt='0.2f') sns.heatmap(wines_corr, annot
```

`<matplotlib.axes._subplots.AxesSubplot at 0x2969e19ed08>`

The features most correlated with the quality are:

- alcohol
- density
- volatile acidity

Most of the features are very highly correlated to each other. We will try to select the best features to avoid redundancies using backward elimination further in the notebook.

# Pre-processing

## Convert quality label to categorical values

`'quality'], wines['alcohol'], palette='hot') sns.barplot(wines[`

`<matplotlib.axes._subplots.AxesSubplot at 0x2969ef0e448>`

Wines between 7 and 9 quality have a higher alcohol values, we will choose this range for the good wines. Thus, it looks appropriate to divide the quality label into low, medium and good wines.

```
= wines['quality'].values
tmp = np.empty(tmp.shape, dtype='U6')
quality_type < 5] = 'bad'
quality_type[tmp 4 < tmp) & (tmp < 7)] = 'medium'
quality_type[(6 < tmp] = 'good'
quality_type[
= pd.DataFrame(columns=['quality'], data=quality_type)
quality_type = wines.drop('quality', axis=1)
wines = pd.concat([wines.reset_index(drop=True), quality_type.reset_index(drop=True)], axis=1)
wines
-1]) sns.countplot(wines.iloc[:,
```

`<matplotlib.axes._subplots.AxesSubplot at 0x2969e3b9f08>`

## Isolate features from the target

Feature scaling is mandatory when using a K-nearest neighbors model. The KNN algorithm has a distance decision based approach to regress or classify an input. In fact, if some features have higher values than other, they will contribute more in the overall distance calculation thus biasing the outcome.

```
= wines.iloc[:, :-1].to_numpy()
X = wines.iloc[:, -1].to_numpy() y
```

## Convert category to integer

`= LabelEncoder().fit_transform(y) y `

## Create the Training Split

```
= train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_test, y_train, y_test
= StandardScaler()
scaler = scaler.fit_transform(X_train)
X_train = scaler.transform(X_test)
X_test
# Convert data to pytorch tensor
= torch.from_numpy(X_train)
X_train = torch.from_numpy(X_test)
X_test = torch.from_numpy(y_train)
y_train = torch.from_numpy(y_test) y_test
```

We will use Pytorch for this experiment

# K-Nearest Neighbors

The K-nearest neighbors classifier looks at the \(K^{th}\) closest points in the training set of an input \(x\) counts how many members of each class are present and returns the empirical fraction as the estimate.

\[\begin{align} &p(\mathbf{y} = c|\boldsymbol{\mathbf{x}}, D, K) \\ &= \frac{1}{K} \sum_{i \in N_K(\boldsymbol{\mathbf{x}}, D)}\mathbb{I}(y_i = c) \\ &\mathbb{I}(e) = \begin{cases} 1 & \text{if}\;e\;\text{is true}\\ 0 & \text{otherwise} \end{cases} \end{align}\]

where \(c\) is the class and \(N_K\) are the closest sample of \(\boldsymbol{\text{x}}\) in \(D\).

```
def knn(sample, X, y, k_neighbors=10):
"""Instance of K-Nearest Neigbhors model
Args:
X: A torch tensor for the data.
y: A torch tensor for the labels.
k_neighbors: An integer for the number of nearest neighbors
to consider.
"""
= sample.unsqueeze(1).T
sample # Compute the distance with the train set
= (X - sample).pow(2).sum(axis=1).sqrt()
dist
# Sort the distances
= torch.sort(dist)
_, indices = y[indices]
y
# Get the Kth most similar samples and return the predominant class
return y[:k_neighbors].bincount().argmax().item()
```

```
def train_knn(X_train, X_test, y_train, y_test, k_neighbors=1):
"""Trains a K-Nearest Neigbhors model
Args:
X_train: A torch tensor for the training data.
X_test: A torch tensor for the test data.
y_train: A torch tensor for the training labels.
y_test: A torch tensor for the test labels.
k_neighbors: An integer for the number of nearest neighbors
to consider.
"""
# Allocate space for the prediction
= np.zeros(y_test.shape, dtype=np.uint8)
y_pred_test = np.zeros(y_train.shape, dtype=np.uint8)
y_pred_train = X_train.clone()
X_train_c
# Predict on each sample of the train and test
for i in range(X_test.shape[0]):
= knn(X_test[i], X_train, y_train, k_neighbors=k_neighbors)
y_pred_test[i] = torch.from_numpy(y_pred_test).float()
y_pred_test return y_pred_test
```

```
= train_knn(X_train, X_test, y_train, y_test, k_neighbors=1)
pred_test print(classification_report(y_test, pred_test))
# Save data for future visualization
= X_train
X_train_viz = X_test
X_test_viz = y_train
y_train_viz = y_test y_test_viz
```

```
precision recall f1-score support
0 0.39 0.17 0.24 53
1 0.65 0.67 0.66 252
2 0.88 0.90 0.89 995
accuracy 0.83 1300
macro avg 0.64 0.58 0.60 1300
weighted avg 0.82 0.83 0.82 1300
```

The KNN achieves a good performance compared to its simplicity.

# Feature Selection

Let’s try to discard some features based on p-value but to see if it improves the results. More on p-value here.

```
= wines.columns[:-1]
columns while len(columns) > 0:
= []
pvalues = wines[columns]
X_1 = sm.add_constant(X_1)
X_1 = sm.OLS(y, X_1).fit()
model = pd.Series(model.pvalues[1:], index=columns)
pvalues = np.argmax(pvalues)
max_idx = pvalues[max_idx]
max_pval if max_pval > 0.10:
= columns.drop(columns[max_idx])
columns print('Dropping column ' + columns[max_idx] + ', pvalue is: ' + str(max_pval))
else:
break
```

```
Dropping column residual sugar, pvalue is: 0.6594184624811075
Dropping column total sulfur dioxide, pvalue is: 0.40799376734207193
Dropping column total sulfur dioxide, pvalue is: 0.16949863758492337
```

` columns`

```
Index(['fixed acidity', 'volatile acidity', 'residual sugar',
'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
dtype='object')
```

```
= wines[columns].to_numpy()
X = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test = torch.from_numpy(X_train)
X_train = torch.from_numpy(X_test)
X_test = torch.from_numpy(y_train)
y_train = torch.from_numpy(y_test)
y_test
= train_knn(X_train, X_test, y_train, y_test, k_neighbors=1)
y_pred print(classification_report(y_test, y_pred))
```

```
precision recall f1-score support
0 0.14 0.14 0.14 58
1 0.53 0.53 0.53 308
2 0.86 0.86 0.86 1259
accuracy 0.77 1625
macro avg 0.51 0.51 0.51 1625
weighted avg 0.77 0.77 0.77 1625
```

Feature selection based on p-value made the performance worse. From 83% to 77%. We tested with a value \(k \in [1, 10]\) and the best result is given \(k = 1\).

Let’s try to select all the feature that has more than \(0.10\) correlation with the target.

```
= ['alcohol', 'density', 'chlorides', 'volatile acidity']
columns
= wines[columns].to_numpy()
X = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test = torch.from_numpy(X_train)
X_train = torch.from_numpy(X_test)
X_test = torch.from_numpy(y_train)
y_train = torch.from_numpy(y_test)
y_test
= train_knn(X_train, X_test, y_train, y_test, k_neighbors=1)
y_pred print(classification_report(y_test, y_pred))
```

```
precision recall f1-score support
0 0.19 0.17 0.18 58
1 0.51 0.51 0.51 308
2 0.85 0.86 0.85 1259
accuracy 0.77 1625
macro avg 0.52 0.51 0.51 1625
weighted avg 0.76 0.77 0.77 1625
```

Same result as backward elimination. Sometimes all the features are relevant for a given model.

# Conclusion

The K-nearest neighbors algorithm is a simple but rather effective approach in some contexts. The KNN method has no training step which is very handy when we have an increasing amount of data. In fact, it is not required to train the KNN model.

However, KNN is not a model to pick when the data has a high dimentionality. Computing the neighbor distances accross a large number of dimension is not effective. This phenomena is also called the “curse of dimensionality”. The KNN model is also uneffective when dealing with outliers.

Want to keep learning ? You can jump right in my next article: **Polynomial Regression From Scratch With PyTorch**.