Exploring Clustering Methods

Old Faithful Dataset

We begin by loading the Old Faithful dataset. The data includes the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. The data can be described as follows:

 A data frame with 272 observations on 2 variables.

eruptions  numeric  Eruption time in mins
waiting    numeric  Waiting time to next eruption

K-means clustering

We write a function k_means_clustering that performs $k$-means clustering for the input data and a number of clusters no_of_clusters.

Here data is matrix $X = \left( \begin{matrix} x_1^\top \\ \vdots \\ x_s^\top \end{matrix} \right) \in \mathbb{R}^{s \times d}$ of $s$ $d$-dimensional vectors.
Optional arguments are:

  1. Initialisation for the centroids (init_centroids).
  2. Maximum number of iterations (maximum_counter).
  3. Tolerance parameter for when the iteration is considered to have converged (tolerance) .
  4. A parameter that determines after how many iterations a progress-update is printed (print_output).

If init_centroids is set to None, then we initialise it as a 2D array of normally distributed random variables with mean zero and standard deviation one.
The algorithm stops as soon as $\left(L(z^k, \mu^k) - L(z^{k + 1}, \mu^{k + 1})\right) / L(z^{k + 1}, \mu^{k + 1}) \leq \text{tolerance}$. $L$ denotes the k-means clustering objective

$$ L(z, \mu) = \sum_{i = 1}^s \sum_{j = 1}^k z_{ij} \| x_i - \mu_j \|^2 \, . $$

The function returns the following:

We cluster the Old Faithful dataset into two clusters, initialising our centroids with the two centroid vectors $\mu_1 = \left( \begin{matrix} 3.5 & 50\end{matrix}\right)^\top$ and $\mu_2 = \left( \begin{matrix} 3 & 80\end{matrix}\right)^\top$.

We plot the resulting clusters.

This is a fairly good result.

MNIST Dataset

We define a function to import the required dataset.

The MNIST dataset is a collection of grayscale images of handwritten digits (0,1,...,9).
Here, for simplicity, we will only look at the images representing the digits 4,6,8, and to make the algorithms run faster we take the first 1000 images.

Here is a sample of these images.

We use the KMeans, GaussianMixture and SpectralClustering methods from SKlearn and compare them.

All methods have clustered the images perfectly.

As a second check, we display the centers of the clusters for KMeans and GMM (the nature of SpectralEmbedding means it does not have centroid vectors).

For spectral clustering, we want to examine the embedding that the algorithm produces.
For that we can use sklearn.manifold.SpectralEmbedding.
We apply the embedding $P:\mathbb{R}^{784}\to \mathbb{R}^2$ to transform the array of high dimensional vectors in mnist_im_sub ($1000\times 784$) into a new array mnist_im_emb that is of dimension $1000\times 2$.

Finally, we would like to get a quantitave estimate for the clustering quality of each of the methods.
Since in this example we know what the ground truth is (the labels given by the MNIST database), we can use an external evaluation.
We use rand_score (imported from sklearn.metrics) to compute the Rand Score for each of the three algorithms.

Spectral Clustering appears to be the most effective.

Image Compression

We explore another clustering application: compressing colour images.
An 8-bit RGB image uses $256 = 2^8$ possible values per colour-channel. In other words, each pixel in the image requires $3\times 8 = 24$ bits of storage.
Suppose we would like to approximate such an image with an $n$-bit RGB image, where $n < 24$. How do we choose the $3 n$ colour-intensity values, so that the image looks similar to the original image? This can be done with clustering methods.

We use the sklearn.cluster KMeans to apply the K-means clustering algorithm to the image.
The number of clusters is set to be $2^\text{nbits}$, where nbits is number of bits used to store each pixel.
We repeat the process for $\text{nbits} = 1,2,6$. Each image is a matrix of 480x480 RGB values, which we have to turn into a vector of 230,400 values.

Our results are pretty good.