Spotify Music Classification

We want to look at the Spotify(c) music dataset by Brice Vergnou on Kaggle. The dataset consists of $150$ songs (randomly selected from $195$ available on Kaggle). The testing set consists of the rest $45$ songs. Each song is characterised by $13$ numbers representing

Every song included in the training set is also assigned with the binary ($0$ or $1$) variable that provides an information on the dataset author's music preferences.

Logistic Regression

We first want a function $\mathtt{linear\_model\_function}$ that implements a linear model function for binary logistic regression defined as: $$ f\left(\mathbf{x}, \mathbf{w}\right) = \left\langle \phi\left(\mathbf{x}\right),\mathbf{w}\right\rangle, $$ where $\phi\left(\mathbf{x}\right)$ is an augmented data vector and $\mathbf{w}$ is a weight vector

Next we want a function $\mathtt{binary\_logistic\_activation\_function}$ that takes an argument named inputs and returns the output of the sigmoid function $$ \sigma\left(\mathbf{x}\right) = \frac{1}{1+\mathrm{e}^{-\mathbf{x}}}. $$ applied to the NumPy array input. Here $\mathbf{x}$ is the mathematical notation for the argument input.

Now we need two functions $\mathtt{binary\_logistic\_prediction\_function}$ and $\mathtt{classification\_accuracy}$ that turn our predictions into classification results and compare how many labels have been classified correctly. The function $\mathtt{binary\_logistic\_prediction\_function}$ takes the argument logistic_values as inputs and returns a vector of class labels with binary values in $\left\{0, 1\right\}$ as its output. The function $\mathtt{classification\_accuracy}$ takes two inputs true_labels and recovered_labels and returns the percentage of correctly classified labels divided by $100$.

We now want two functions that implement the cost function for binary logistic regression as well as its gradient, as defined below. $$ \mathrm{L}\left(\mathbf{w}\right) = \frac{1}{s} \left(\sum\limits_{i=1}^s \log\left[1+\exp\left(f\left(\mathbf{x}^{(i)},\mathbf{w}\right)\right)\right] - y_i\cdot f\left(\mathbf{x}^{(i)},\mathbf{w}\right)\right), $$ where $\phi\left(\mathbf{x}^{(i)}\right)$ is an augmented $i$-th data vector and $f$ is a model function. In the case of linear model function $f\left(\mathbf{x},\mathbf{w}\right) = \left\langle \phi\left(\mathbf{x}\right),\mathbf{w} \right\rangle$ one has $$ \nabla \mathrm{L}\left(\mathbf{w}\right) = \frac{1}{s} \left( \sum\limits_{i=1}^s \phi\left(\mathbf{x}^{(i)}\right)\cdot\sigma \left(\left\langle \phi\left(\mathbf{x}^{(i)}\right),\mathbf{w} \right\rangle \right) - y_i\cdot \phi\left(\mathbf{x}^{(i)}\right) \right), $$ where $y_i$ are the corresponding data labels.

We implement the gradient descent algorithm, along with a $\mathtt{gradient\_descent\_v2}$ algorith that includes a stopping criterion to end the process when:

$$ \left\| \nabla L\left(\mathbf{w}^{(k)}\right)\right\|_2 \leq \mathrm{tolerance}, $$

is satisfied. Here $L$ and $\mathbf{w}^{(k)}$ are the mathematical representations of the objective $\mathrm{objective}$ and the weight vector weights, at iteration k. The parameter tolerance is a non-negative threshold that controls the Euclidean norm of the gradient. The function $\mathtt{gradient\_descent\_v2}$ takes the arguments $\mathtt{objective}$, $\mathtt{gradient}$, initial_weights, step_size, no_of_iterations, print_output and tolerance. The arguments $\mathtt{objective}$ and $\mathtt{gradient}$ are functions that can take (weight-)arrays as arguments and return the scalar value of the objective, respectively the array representation of the corresponding gradient. The argument initial_weights specifies the initial value of the variable over which you iterate. The argument step_size is the gradient descent step-size parameter, the argument no_of_iterations specifies the maximum number of iterations, print_output determines after how many iterations the function produces a text output and tolerance controls the norm of the gradient as described in the equation above.

The code in the following cell

In the following cell we write a function $\mathtt{standardise}$ that standardises the columns of a two-dimensional NumPy array data_matrix. The function returns a triple: the normalised matrix, the row of column averages and the row of column standard deviations. We also include a function $\mathtt{de_standardise}$ to reverse the process.

Standardising our data.

In order to prepare our normalised data for a data analysis we also need to build an augmented data matrix. We implement a function $\mathtt{linear\_regression\_data}$ that computes (and outputs) the linear regression data_matrix for a given data_inputs matrix.

We now apply the above to the Spotify dataset:

We now evaluate classification accuracy.

Ridge Regression

We now attempt to increase the classification accuracy by considering the ridge binary logistic regression.

We define two functions $\mathtt{ridge\_binary\_logistic\_regression\_cost\_function}$ and $\mathtt{ridge\_binary\_logistic\_regression\_gradient}$ that take $4$ arguments: the NumPy arrays data_matrix, data_labels and weights, and a positive float number regularisation_parameter. The modified cost function is defined as $$ \mathrm{L}_{\alpha}\left(\mathbf{w}\right) = \mathrm{L}\left(\mathbf{w}\right) + \frac{\alpha}{2}\left\|\mathbf{w}\right\|^2, $$ where $\mathrm{L}\left(\mathbf{w}\right)$ is a cost function for the binary logistic regression defined above, while the gradient function correspondingly is given by $$ \nabla \mathrm{L}_{\alpha}\left(\mathbf{w}\right) = \nabla \mathrm{L}\left(\mathbf{w}\right) + \alpha \mathbf{w}, $$ where $\alpha$ is a mathematical representation of the regularisation_parameter.

We implement a function $\mathtt{grid\_search}$ that performs a search for a minimum value of a given function on a given grid points. This takes two parameters:

The code below finds the optimal value of hyperparameter regularisation_parameter and then finds the spotify_optimal_weights corresponding to this hyperparameter value.

Ridge regression does not seem to improve our model accuracy.