# Practical KMeans Clustering with Python

- Introduction
- Generating sample data
- Feature scaling
- Determining $K$
- Model fitting
- Model accuracy
- Conclusion
- Additional links

## Introduction

KMeans clustering is perhaps the most well-known technique of partitioning similar data into the same clusters. We take $n$ data points and form $K$ clusters, where $K$ needs to be provided by the user and depends on the dataset. Each cluster has a cluster mean (thus KMeans clustering) and the objective is to minimize each cluster’s variance.

### Algorithm

- Choose $K$ number of clusters
- Select $K$ centroids (randomly)
- Assign each data point to closest centroid
- Compute new centroids
- Reassign data points according to new centroids. If no reassignment needed, stop.

With iteration of steps 4) and 5), the centroids will ultimately converge to appropriate centers of the clusters.

## Generating sample data

Let’s generate sample data points that can then be used to build clusters. Python’s `make_blobs`

function is helpful in quickly generating the desired points within a 2D space with the desired Gaussian distribution as shown below:

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
centers = [[1, 1], [-1, -1], [1, -1]]
X, y_true = make_blobs(n_samples=500, centers=centers, cluster_std=0.4,
random_state=31337)
plt.figure(figsize=(12,9))
plt.annotate('amirootyet.com', xy=(0.03, 0.95), xycoords='axes fraction')
plt.scatter(X[:, 0], X[:, 1], s=50)
```

Clearly, we see 3 clusters of data with an isotropic Gaussian distribution. But for the purposes of this exercise, we will assume that we do not know the number of clusters. The next logical step is to determine what the most suitable value for $K$ is in our dataset. But before that, we’ll standardize the dataset.

The array `X`

contains $(x,y)$ coordinates for the generated data points, whereas `y_true`

stores the true cluster numbers. We’ll use `y_true`

later to see how our clustering algorithm performed against this dataset.

## Feature scaling

Performing data scaling is essential in order to normalize or standardize data in most datasets. For our dataset, we will use standardization as shown below. There’s no need to standardize `y_true`

since these values will not be used for model training.

```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
```

## Determining $K$

The visualization clearly shows that there are 3 clusters of data, however a more definitive method of finding $K$ is to run the Elbow method and/or the Silhouette scoring method.

### Elbow method

The notion of Within Cluster Sum of Squares (WCSS) is targeted at minimizing the overall distance between points and their respective centroids in the clusters.

$WCSS = \sum distance{(P_i, C_1)}^2 + \sum distance{(P_i, C_2)}^2 + \sum distance{(P_i, C_3)}^2$

Intuitively, we will choose the number of clusters beyond which WCSS or distortions drops are non-significant.

```
from sklearn.cluster import KMeans
distortions = []
for K in range(1,11):
clusterer = KMeans(n_clusters=K, init='k-means++')
model = clusterer.fit(X)
distortion = model.inertia_
distortions.append(distortion)
Ks = [i for i in range(1,11)]
plt.plot(Ks, distortions)
plt.xlabel('Number of clusters')
plt.ylabel('Distortions')
plt.annotate('amirootyet.com', xy=(0.75, 0.95), xycoords='axes fraction')
plt.xticks([i for i in range(1,11)])
```

Note that the `inertia_`

attribute of the KMeans object will provide us the needed distortion or WCSS value.

### Silhouette method

```
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
scores = []
for K in range(2,11):
clusterer = KMeans(n_clusters=K, init='k-means++')
model = clusterer.fit_predict(X)
score = silhouette_score(X, model)
scores.append(score)
Ks = [i for i in range(2,11)]
plt.plot(Ks, scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score')
plt.annotate('amirootyet.com', xy=(0.75, 0.95), xycoords='axes fraction')
```

Clearly, in both cases, we see that we have 3 clusters in this dataset. Therefore, during model fitting we will set $K$ to 3.

## Model fitting

Now that we know the appropriate number of clusters, $K$, and have standardized our data, we are ready to fit the KMeans clustering model on the dataset. While performing this exercise, keep in mind to use the `k-means++`

initialization in `scikit-learn`

since it will avoid the trap of poor initial seed values (Random Initialization Trap).

```
from sklearn.cluster import KMeans
classifier = KMeans(n_clusters=3, init='k-means++', random_state=0, n_jobs=-1)
model = classifier.fit(X)
y_pred = model.labels_
```

Instead of generating `model.labels_`

separately, you can also use `fit_predict`

to directly generate the prediction values. Once we have the predicted values, `y_pred`

, we can plot the different clusters to visualize model performance. The converged cluster centers can be accessed using the `cluster_centers_`

attribute of the KMeans model. This can be helpful if we wish to plot the cluster centroids.

```
plt.figure(figsize=(12,9))
plt.annotate('amirootyet.com', xy=(0.03, 0.95), xycoords='axes fraction')
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=50, cmap='Dark2')
```

Looks like the KMeans model did pretty well on this dataset! In the next step, we determine exactly *how* well.

## Model accuracy

We compute the accuracy score since `y_true`

holds the true cluster numbers and allows us to compare against ground truth.

```
from sklearn.metrics import accuracy_score
import numpy as np
print("Accuracy:", accuracy_score(y_true, y_pred))
```

```
Accuracy: 0.998
```

We also build a confusion matrix to visualize where and how many times our model predicted incorrectly.

Only one wrong prediction with an accuracy score of $0.998$ – a very accurate model indeed!

## Conclusion

Of course, in the real-world, we’ll be unlikely to have a 2D feature set with such a uniform Gaussian distribution. However, this post clarifies the practical application of KMeans clustering with `scikit-learn`

.

Note: The `.ipynb`

pertaining to this exercise is available on my Github here.