Sound classification with deep learning

Our goal is to classify the spoken digits from the AudioMNIST Dataset. The training set contains 2400 samples with 900 Mel Frequency Cepstrum coefficients each. We will use these 900 coefficients as the features to train on.
Below is a visualisation of what each of these samples look like as an image.

We implement a neural network from scratch, using an architecture of only two densely connected layers. Below are all the classes and functions we will need to train the network.

We now train the network for 20 epochs over a range of learning rates. As we do not have a validation set for this data, we validate on training accuracy.

As would be expected, our training error is always higher than testing error, but there doesnt seem to be any sign of overtraining at this small number of epochs, and general accuracy is quite high across all learning rates.

We now train with 100 epochs on the optimum learning rate to see if this improves accuracy.

Our neural network appears to struggle with differentiating 8's from 6's, but apart from this is extremely accurate. Perhaps normalising data, adding more layers, involving convolutional layers etc. could improve this.